312034

Abstract—In this paper, a study related to the expected

performance behavior of present 3-level cache system for multi-core systems is presented. For this a queuing model for present 3-level cache system for multi-core processors is developed and its possible performance has been analyzed with the increase in number of cores. Various important performance parameters like access time and utilization of individual cache at different level and overall average access time of the cache system is determined. Results for up to 1024 cores have been reported in this paper.

Keywords—Multi-Core, Memory Hierarchy, Cache Access Time, Queuing Analysis .

I. INTRODUCTION HIS work is in continuation of the research at NED university related to a design of new cache model form

multi-core processors [1], [2], [3], [4], [5]. In this paper a study regarding the possible behavior of the present 3-level cache system in multi-core system with the increase in number of cores is presented. This performance analysis is done using queuing modeling. Queuing theory is one of a well-established mathematical modeling method to study and analyze the queuing phenomena in a system and because of its realistic behavior, it is being used extensively to study the behavior of various computer related applications like scheduling, users and processes management, batch processing, multi-programming, virtualization, uni- and multi-processor computer architecture design etc.

The current general purpose multi-core processors [13], [14], [15] use the present 3-level (L1, L2 and L3) cache scheme. In this cache model every core has its own private L1 cache. The L2 cache may be private, shared or split. In first case L2 is kept local to a core like in case of Intel Pentium 4D and Itanium 2. In second approach which is more general, each core has its own private L1 cache and shares L2 through L1. Intel dual core and core duo and AMD dual and quad core

Dr. Muhammad Ali Ismail is an Assistant Professor in the Department of Computer and Information Systems Engineering, NED University of Engineering and Technology, Karachi - 75270, Pakistan. Tel.: + 92-3002255856; fax: +92-21-99261255, [email protected].

processors have this configuration. And in third case L2 cache is usually shared by half of the cores. Remaining half of the cores have their own shared L2 cache and then all L2 caches are connected to L3. Intel Xeon quad and eight core processor has two L2 cache shared by two cores. Sun Ultra Sparc T2 has four bi-shared L2 caches and one L3 cache for eight cores.

With the present dual, quad and eight cores, the present 3-level cache system is working well. But as number of cores increases, this may cause the bottleneck and result in data traffic congestions and performance degradation [6], [7], [8].

The queuing theory is a mathematical method to analyse a

client-server system. Queuing theory examines every component of a system in line to be served, including the arrival process, service process, number of servers, number of system places and the number of customers. A queuing network analysis is useful to determine many performance parameters like mean response time, marginal probabilities, utilization, throughput, mean number of jobs, mean queue length, and mean waiting time for any individual server and for the complete network[9], [10]. In queuing theory, any queuing system is described by kendal’s notation. It was first proposed by d. G. Kendall in 1953. This standard notation, ‘a/b/c/k/n/d’, is used to describe, characterize and classify the queuing model that a queuing system corresponds. Each character in the kendal’s notation defines a specific characteristic of a queue as below: ‘a’ defines the arrival process. ‘b’ defines the service time distribution, ‘c’ defines the number of parallel servers in the queuing system. ‘k’ defines the capacity of the queuing system, or the maximum number of customers allowed in the system including those in service, ‘n’ defines the size of the population from which the customers come and ‘d’ defines the service discipline. In queuing theory, m/d/c/k [11, 12] is one of the analytical models that generalize the solution of markovian queues to the case of constant service time distributions. In the m/d/c/k theory, the arrival process is poisson and service rate is constant and deterministic. C and k represent the number of parallel servers and system capacity respectively.

Performance Behavior Analysis of the Present 3-Level Cache System for Multi-Core Systems

using Queuing Modeling Muhammad Ali Ismail

T

International Conference on Latest Computational Technologies (ICLCT'2012) March 17-18, 2012 Bangkok

61

mailto:[email protected]

II. QUEUING MODEL FOR 3-LEVEL CACHE SYSTEM Any cache hierarchy may be analysed using queuing theory

by considering every cache as a server and data request either by a CPU or the lower level of cache as a client. A complete cache hierarchy may be considered as an open queuing network where multiple servers (caches) are attached in a specific pattern. A request from a client is served by a specific server (cache). If the server (cache) fulfils the request then the client leaves the queue otherwise the request is sent to next server (upper level cache). Probability of a request being fulfilled or not at any server (cache) is the same as the hit or miss ratio. Similarly the mean response time of the server is same as that of average cache access time. Using queuing network, performance parameters like mean response time (average cache access time), marginal probabilities, utilization, throughput, mean number of jobs, mean queue length, and mean waiting time may be calculated for any individual server (cache) and for the complete network (cache hierarchy). Considering the present 3-level cache system for 'N' cores where L1 cache is private to every core. L2 is being shared by half of the cores. And L3 is then being shared by two L2 caches. For analyzing L2 cache level M/D/1/N/2 –FIFO queuing model can be applied and M/D/1/2 queuing model for analyzing L3 cache. L1 is simple to analyze as it is private to a core. Queuing model for L2 cache is shown in fig. 1 and complete queuing network is given in fig. 2. Let;

be the average data request rate by a core.

be the arrival rate of requests at the ith level of cache

, be the probability of cache being hit

such that , be the probability of cache being missed at the ith level and search will proceed in the upper (i+1)th cache level. Also and be the average cache access time and cache

utilization for ith cache level respectively. If be the average

request rate of data by a core and is data request rate for L1

cache which is always equal to the request rate made by a core, then we can write

(1) For request rate at L2, Considering the definitions of probability of cache miss and hit and using figure-2 we get.

Substituting for , from (1)

(2)

Request rate at L3 cache can be given by; Substituting for , from (2) We get

(3)

Cache utilization at each level may be calculated as; For L1 cache

(4)

For L2 cache

(5)

Fig. 1. Atomic model of L2 cache for N cores for 3-level cache system

And for L3 cache (6)

For an M/D/1 queuing model, the average cache access time, of an ith cache level can be calculated using the

following equation (7) [5].

(7)

Queuing System

Atomic Block Model of 2nd level Cache for N Cores for 3 levels Cache System

To cache level (i +1)

Cache Level i-1

Cache Level i

E[s]=2*Csi / N

Cache Level i-1

Cache Level i-1

N/2

Cache

Level


62

Overall average cache access time of the 3-level cache

queuing model can be calculated using Little‘s Equation .

For 3-Level cache system

(8)

Fig. 2. Queuing network model for 3-Level cache system

III. PERFORMANCE EVALUATION For evaluating performance the cache system presently

available general purpose processor’s configuration is assumed, that is, each core is operating at 4.0 GHz, 4 MB L1 cache, 64 KB L2 cache and 1GB main memory operating at 1333 MHz. For detailed queuing network analysis, two more input parameters are required. First the initial data request rate made by the cores for their respective private L1 caches. It is taken as half of core speed. Second is the probability for data request generation. For this equal probabilistic method is used that is, every cache block has equal chance to be executed. It is done because of two reasons, firstly the two more important parameters, number of running programs at any given time and their size, can be included in calculations and their impact can be observed. Secondly this makes the queuing networks more generalized. The probability of finding data at any cache level is calculated using (9).

(9)

For 3-level cache system, average data request rate at respective cache level can be calculated using (1), (2) and (3). Following, respective cache utilization may be calculated using (4), (5) & (6), individual cache access time using (7) and finally overall average cache access time may be calculated using (8).

Table1 shows the overall average access time for the present 3-level cache system for different number of cores.

Similarly, fig. 3 shows the graphical representation of the overall average access time for the present 3-level cache system for different number of cores.

TABLE 1

ACACHE ACCESS TIME (NSEC) FOR 3-LEVEL CACHE SYSTEM FOR DIFFERENT NUMBER OF CORES

Number of Cores (nsec) 4 1.72 8 2.39

16 3.73 32 6.39 64 11.72

128 22.38 256 43.66 512 86.13

1024 170.6

Fig.3 . Average cache access time (nsec) for 3-level Cache system

for different number of cores

L3 Miss,

L3 hit,

L3 Cache

L1 Miss,

L2 Miss,

L1 hit,

L2 hit,

L1 Miss,

MA I N

MEMORY

Core 0

L1 Private Cache

L2 Cache sharing N/2 Cores

L1 Private Cache

L1 hit,

L2 hit,

L2 Miss,

L2 Cache sharing N/2 Cores

Core N


63

Table II shows the utilization of all 3 cache levels with different number of cores. Similarly, fig. 4 shows the graphical representation shows the utilization of all 3 cache levels with different number of cores.

TABLE II

CACHE UTILIZATION FOR 3-LEVEL CACHE SYSTEM FOR DIFFERENT NUMBER OF CORES

No. of Cores

P2,1 P3,1

4 0.500 1.333 1.999

8 0.500 2.667 1.999

16 0.500 5.333 1.999

32 0.500 10.665 1.999

64 0.500 21.325 1.998

128 0.500 42.633 1.998

256 0.500 85.200 1.996

512 0.500 170.133 1.993

1024 0.500 339.200 1.987

1ρ 3ρ2ρ

Fig. 4. Utilization of L1, L2 & L3 cache in 3-level cache system with different number of cores. Overutilization of L2 cache may be noted

IV. RESULT ANALYSIS Table I and fig. 3 clearly shows that by increase the number

of cores, the overall average cache access time increases. It is due to more number of requests per time and present 3-level cache is being proven inefficient to handle those requests.

At the same time, over-utilization of L2 cache can also be observed. L2 cache is becoming more congested as number of cores increases. It can be observed in table II and fig. 4. It is due to all L1 private caches are communicating with L2 at the same time in case of cache miss and this request rate increases as the number of cores increases.

V. CONCLUSION This study validates the idea that the present 3-level cache

system is working well for less number of cores but as number of cores will increase, this cache systems will cause memory

bottleneck and will result in data traffic congestions and performance degradation. In order to avoid this possible memory behavior or memory wall problem, the computer designer must work on new cache memory hierarchy design for multi-core processors.

REFERENCES

[1] M.A. Ismail, S.H. Mirza, T. Altaf, “LogN+1 and LogN Cache System for

Multi-Core Processors”, Application accepted for US patent through HEC, May 07, 2010, further processing is underway.

[2] M.A. Ismail, S.H. Mirza, T. Altaf, "Design of a Cache Hierarchy for LogN and LogN+1 Model for Multi-Level Cache System for Multi-Core Processors", In Proc. of International conference on Frontiers of Information Technology (FIT)-09 , ACM, Dec 16 – 18, 2009, Pakistan.

[3] M.A Ismail, S.H. Mirza, T. Altaf, "Binary Tree Based Multi-level Cache System for Multi core Processors" In Proc. HPCNCS-09, July 13 – 16 2009, Orlando, Florida, USA.

[4] welcome.maismail.info/bcm [5] http://www.neduet.edu.pk/cise/MCPAResearch.html [6] J. Tao, M. Kunze, W. Karl, “Evaluating the Cache Architecture of

Multicore Processors”, in proc. of16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, IEEE Computer Society, 2008.

[7] S. K. Moore, “Multicore is bad news for supercomputers”, IEEE spectrum, 2008.

[8] The core of the issue: multi-core and you”, Linux magazine November, 2007.

[9] D. Gross, J.F. Shortle, J.M. Thompson, “Fundamentals of queuing theory”, John Wiley & Sons, 2008.

[10] G. Bolch, S.Greiner, H. Meer, K.S.Trivedi, “Queueing Networks and Markov Chains”, Chapter 7, John Wiley & Sons, 2007.

[11] C. G. Cassandras, S. Lafortune, “Introduction to Discrete Event Systems”, Springer, 2007.

[12] G. Giambene, “Queuing Theory and Telecommunications: Networks and Applications”, Chapter 6, Springer, 2005.

[13] AMD: http://www.amd.com/uk/products/Pages/processors.aspx [14] Intel: http://www.intel.com/products [15] Sun Systems: http://www.sun.com/servers/index.jsp Muhammad Ali Ismail did his M.Engg in Computer Engineering with specialization in Computer Systems and Design in 2007 and in 2011 he got his PhD in Computer Engineering in the field of Multi-core Computing. His areas of interest include Serial and Parallel Processing computer architectures, distributed computing, Cluster and Grid Computing, Memory Management and related Algorithms, Parallel Programming languages, models and algorithms. He is a member of IEEE (USA) and IEE (UK).


64

312034

Documents