simulating non-uniform memory access architecture for...

SIMULATING NON-UNIFORM

MEMORY ACCESS ARCHITECTURE

FOR CLOUD SERVER APPLICATIONS

Joakim Nylund

Master of Science ThesisSupervisor: Prof. Johan Lilius

Advisor: Dr. Sébastien LafondEmbedded Systems Laboratory

Department of Information TechnologiesÅbo Akademi University

Autumn 2011

ABSTRACT

The purpose of this thesis is to evaluate and define architectural candidatesfor cloud based servers. The research focuses on the interconnect and memorytopology of multi-core systems. One specific memory design is investigatedand the Linux support for the architecture is tested and analyzed with the helpof a full-system simulator with modified memory architecture. The resultsdemonstrates how available tools in Linux can be used to efficiently run taskson separate CPU’s on large systems with many processing elements.

Keywords: Interconnect, Cloud Computing, NUMA, Linux, Simics

i

LIST OF FIGURES

2.1 The Memory Hierarchy [35]. . . . . . . . . . . . . . . . . . . . . . 4

2.2 Quad-core AMD Opteron Processor [1]. . . . . . . . . . . . . . . 6

2.3 Simple SMP System. . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Cache Coherence in a Dual-core CPU. . . . . . . . . . . . . . . . 10

2.5 Write invalidate bus snooping protocol [2]. . . . . . . . . . . . . 10

2.6 Write broadcast bus snooping protocol [2]. . . . . . . . . . . . . 11

3.1 Common Network Topologies [3]. . . . . . . . . . . . . . . . . . 14

3.2 6D Mesh/Torus Architecture [25]. . . . . . . . . . . . . . . . . . 15

3.3 Open Compute Motherboard based on Intel CPU and Quick-Path Interconnect [4]. . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Open Compute Motherboard based on AMD CPU and Hyper-Transport Interconnect [4]. . . . . . . . . . . . . . . . . . . . . . . 18

3.5 Next Generation ARM SoC [5]. . . . . . . . . . . . . . . . . . . . 20

4.1 The Simplest NUMA System [42]. . . . . . . . . . . . . . . . . . . 22

4.2 Different Motherboard Topologies for the Quad-core AMDOpteron CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

ii

4.3 ACPI System Locality Information Table (SLIT) [34]. . . . . . . . 27

4.4 ACPI Static Resource Affinity Table (SRAT) [34]. . . . . . . . . . 28

5.1 Cache Read and Write of Core-0 and Core-1. . . . . . . . . . . . 34

5.2 Cluster NUMA Architecture. . . . . . . . . . . . . . . . . . . . . 37

5.3 A Two Level Cache System [43]. . . . . . . . . . . . . . . . . . . . 39

5.4 The hardware for Simics NUMA. . . . . . . . . . . . . . . . . . . 42

5.5 All to all message passing on one core. . . . . . . . . . . . . . . . 43

5.6 All to all message passing on four cores. . . . . . . . . . . . . . . 44

5.7 Comparison of one core and four cores (Big). . . . . . . . . . . . 45

5.8 All to one message passing on one core. . . . . . . . . . . . . . . 45

5.9 All to one message passing on four cores. . . . . . . . . . . . . . 46

5.10 Comparison of one core and four cores (Bang). . . . . . . . . . . 46

6.1 Future Work Illustrated. . . . . . . . . . . . . . . . . . . . . . . . 49

B.1 Emark Benchmark Results. . . . . . . . . . . . . . . . . . . . . . . 62

iii

CONTENTS

Abstract i

List of Figures ii

Contents iv

1 Introduction 1

1.1 Cloud Software Program . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Memory Architecture 3

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 The Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Primary Storage . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Secondary Storage . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Symmetric Multiprocessing . . . . . . . . . . . . . . . . . 8

2.4.2 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Interconnect 12

iv

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 High-performance Computing . . . . . . . . . . . . . . . 14

3.3.2 Intel QuickPath Interconnect and HyperTransport . . . . 15

3.3.3 Arteris Interconnect . . . . . . . . . . . . . . . . . . . . . 15

3.3.4 Open Compute . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 ARM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.1 Cortex-A Series . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.2 Advanced Microcontroller Bus Architecture (AMBA) . . 20

3.4.3 Calxeda 120 x Quad-core ARM Server Chip . . . . . . . . 20

4 Non-Uniform Memory Access (NUMA) 22

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Advanced Configuration and Power Interface . . . . . . . . . . 25

4.4.1 System Locality Information Table . . . . . . . . . . . . . 26

4.4.2 Static Resource Affinity Table . . . . . . . . . . . . . . . . 27

4.5 The Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5.1 NUMA Aware Scheduler . . . . . . . . . . . . . . . . . . 28

4.5.2 Memory Affinity . . . . . . . . . . . . . . . . . . . . . . . 29

4.5.3 Processor Affinity . . . . . . . . . . . . . . . . . . . . . . . 29

4.5.4 Fake NUMA Nodes . . . . . . . . . . . . . . . . . . . . . 30

4.5.5 CPUSET . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5.6 NUMACTL . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Implementation 33

v

5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Simics Full-system Simulator . . . . . . . . . . . . . . . . . . . . 33

5.2.1 History and Features . . . . . . . . . . . . . . . . . . . . . 34

5.2.2 Working With ARM Architecture . . . . . . . . . . . . . . 35

5.2.3 gcache and trans-staller . . . . . . . . . . . . . . . . . . . . 38

5.3 Oracle VM VirtualBox . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 Erlang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4.1 Distributed Programming . . . . . . . . . . . . . . . . . . 40

5.4.2 Erlang Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4.3 Asynchronous Message Passing . . . . . . . . . . . . . . 40

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.6 Analysis of Emark Benchmarks . . . . . . . . . . . . . . . . . . . 42

6 Conclusions and future work 48

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Bibliography 51

Swedish Summary 55

A Simics Staller Module Implementation 59

B Emark Results 62

vi

CHAPTER

ONE

INTRODUCTION

As the core count in many architectures is constantly growing, the rest of thesystem also has to be updated to meet the higher demands the processing unitsset. One of these demands is the interconnect. As more processing power isavailable, the interconnect has to be able to move significantly more data thenbefore. Another demand is the memory. With a traditional setup, there is onlyone memory available via one interconnect. When several processing elementsare continuously asking for data, they are going to spend most of their timewaiting, unless memory is not available in several places through severalpaths. This work, investigates the Non-Uniform Memory Access (NUMA)design, a memory architecture tailored for many-core systems, and presentsa method to simulate this architecture, for evaluation of cloud based serverapplications. The work also introduces and uses the NUMA capabilities foundin the Linux kernel, and results from tests running on a simulated NUMAinterconnect topology are presented and analyzed.

1.1 Cloud Software Program

The Cloud Software Program is a four-year (2010-2013) research programorganized by Tieto- ja viestintäteollisuuden tutkimus TIVIT Oy [6]. The goal ofthe program is to increase the international competitiveness of Finnish cloudbased software. A Cloud Server From Finland could contribute with, amongother things, a skillful implementation of a sustainable open system [7]. Atthe Embedded Systems Laboratory at Åbo Akademi University we provide to

1

the Cloud project research related to energy efficiency, with the most currentinterest in both power smart scheduling and effective interconnect betweenprocessing units and memory.

1.2 Thesis Structure

Chapter 2 presents to the reader the common memory architecture of moderncomputers and embedded systems, with descriptions of different factorsaffecting the memory in terms of speed and scalability. In Chapter 3 weexplain and compare different interconnect topologies and present how theinterconnect is implemented in different architectures. More deeper analysisof the Non-Uniform Memory Access system is included in Chapter 4 withexamination of the Linux kernels NUMA capabilities. Chapter 5 describesthe work process of this thesis and presents the results acquired from EmarkErlang benchmarks running in Simics. Finally Chapter 6 sums up thepresented material and examines possible future work in this field.

2

CHAPTER

TWO

MEMORY ARCHITECTURE

2.1 Introduction

Memory is, and has always been, a fundamental part of every computersystem. Today, an increasing amount of electronic devices have become moreadvanced and therefore act more or less like a computer, consisting of aprocessing unit with a specific memory setup. At the same time, as devicesdesigned to do simple tasks have become more complex, the already advancedcomputer systems have also progressed even further. Computer systems withthe latest technology and fast processing elements have become more efficientin many ways over the years, and one of the components that has evolved andgrown in complexity is the memory architecture. Still, the principal memoryarchitecture has remained the same for all these years, and already in the 1940s,Mr. von Neumann stated that a computer memory has to be engineered in ahierarchical manner [24].

2.2 The Memory Hierarchy

Even though the idea and usage of a memory hierarchy has remained the samefor years, the size of the hierarchy has increased and the amount of layersseem to be slowly but constantly growing. Most recently, new layers of cachememory [33] and even Non-Uniform Memory Access have become presentin some general-purpose CPU’s memory hierarchy [35]. These new layers

3

of memory have been added to the hierarchy in order to manage the higherdemands the increasing speed and growing amount of the CPU’s set on thewhole memory subsystem.

The computer memory can basically be divided into two groups. The firstone is the primary storage, which consists of fast memory directly accessibleby the CPU. This type of memory is volatile, meaning it can not save its stateunless it is powered on. Consequently, when a computer system is poweroff or rebooted, all data in the primary storage is lost. The primary storageis also, other than fast, typically small and expensive [35]. Secondary storageis in many ways the opposite to primary storage. Firstly, it is a non-volatilememory. This means it can save its state even when powered off. Accordingly,when a computer system is powered off or rebooted all the data stored in thesecondary storage is saved and accessible when the system is next time started.Secondary storage is generally named external memory and it is characterizedby being large, slow and cheap. Secondary memory can not be accessed by theCPU directly which makes it even more complicated and slow to access thanprimary memory [35]. The memory hierarchy of a computer is illustrated inFigure 2.1.

Figure 2.1: The Memory Hierarchy [35].

4

2.2.1 Primary Storage

The purpose of the hierarchy is to make it possible for the CPU to quicklyget access to data, so that the CPU can spend more time working, instead ofwaiting for data. A modern computer system moves data that is needed now,and in the future, from the slower memory up to the faster memory. This iswhy the small and expensive fast memory is located near the CPU, so thatthe processor can quickly access the needed data. The top half of the memoryhierarchy in Figure 2.1 is composed of the primary storage, which consists ofthe following parts:

• CPU Registers

The CPU registers lie at the very top of the memory hierarchy. There areonly a few processor registers in a CPU, but these are very fast. A computermoves data from other memory into the registers, where it is used fordifferent calculations. Most new computers are based on the x86 instructionset architecture (ISA) and have both hardware and software support for themost recent 64-bit x86 instruction set. This architecture consists of 16 x 64-bitgeneral-purpose registers. In comparison, older computers using the 32-bit x86instruction set only have 8 x 32-bit general-purpose registers [39]. The popularARMv7 instruction set architecture used in basically all smartphones andtablets, has 16 x general-purpose registers, all 32-bit long, making it basicallya compromise, in terms of CPU registers, of the 32- and 64-bit x86 architecture[23]. When writing a program for some architecture, the compiler usually takescare of which variables are to be saved in CPU registers, but obtaining a moreoptimized system manually, register variables can be declared by the developerin programming languages, like C with:

register int n; [8].

• Cache

Cache memory is the fast memory which lies between the main memoryand the CPU. Cache memory stores the most recently accessed data andinstructions, allowing future accesses by the processor to the particular dataand instructions to be handled much faster. Some architectures, like AMD’sQuad-Core Opteron presented in Figure 2.2, has up to three different cachelevels (L1-L2-L3). The lower the level, the smaller and faster the memory is.The Opteron processor has separate instruction (64 KB) and data (64 KB) L1

5

Caches for every processing core. We can see illustrated in Figure 2.2 that the512 KB L2 Caches are also private for every core, but the biggest 8 MB L3 cacheis a shared memory for all cores [1].

Figure 2.2: Quad-core AMD Opteron Processor [1].

• Main Memory

After the caches comes the main memory in the hierarchy. These areusually Dynamic Random-Access Memory (DRAM) which means they arerelatively cheap, but still quite fast [35]. Existing mainstream smartphones andcomputers implementing either the ARM or x86 architecture have typicallya main memory of the size between 1-4 GB [9][10]. The DRAM can easilybe upgraded on most computer systems, making it an quick and cheap wayof increasing the performance of the system. This may, for instance, allowolder computers to be upgraded with modern operating systems as the neweroperating systems often have higher main memory size requirements.

• NUMA

This layer of the memory hierarchy is usually not present in a mainstreamcomputer architecture, but will most likely be in the future, as the core count

6

is constantly growing also for home computers. Today, systems like the AMDOpteron (see Figure 2.2), which is designed for high-performance computingand servers, implement a Non-Uniform Memory Access (NUMA) architecture.This means that all cores have local main memory that is near the processingunit. This results in fast memory access times for all cores to their localmemory. But all memory nodes can also be accessible by the other, moredistant cores. This makes the memory-core relationship remote and the accesstimes are much slower compared to accessing local memory. The speed of boththe local and remote accesses are always dependent on the particular setupwith the different physical locations of the CPU’s and memories [42].

• Virtual Memory

Virtual memory takes use of both primary- and secondary storage. It simulatesa bigger main memory with the help of the slower and cheaper secondarystorage. This makes it easier to develop applications, as the programs haveaccess to one big chunk of virtual memory and all fragmentation’s are hidden[35].

2.2.2 Secondary Storage

The bottom half of the memory hierarchy in Figure 2.1 is composed of thesecondary storage. We will only describe the next two layers of the hierarchy,because the rest of the layers are more distant memory systems, that areactually not immediately accessible by the computer. So, the next parts in thehierarchy are:

• File Storage

The File Storage layer in computers usually consists of the Hard Disk Drive(HDD), which is a magnetic data storing device. The capacity of a file storagedevice is significantly higher than the capacity of the primary storage, butaccessing the file storage is substantially slower than accessing any primarystorage [35].

• Network Storage

The last layer is the Network Storage. Here data is stored at an entirelydifferent system, but is still immediately accessible by the computer. Now thebandwidth of the network plays a significant role on the speed and throughputof data transfer [35].

7

2.3 Locality

The principle of locality is what makes the use of cache memory more useful,as it saves the most recent data in a fast memory close to the CPU [35]. Thereare two common types of locality of reference used in computer architectures.These are temporal locality and spatial locality. The concept of temporal localityis that, if a value is referenced, it is probably going to be referenced again inthe near future, as this is the standard case in most running programs. Wherespatial locality occurs when one arbitrary memory address is referenced, thanthe physical locations close are probably also going to be referenced soon.Again, because this is usually the case in an executing program. Therefore datais moved on a computer system to the faster memory when these conditionsare met.

2.4 Shared Memory

Shared memory exists already at the second cache level, as seen with theOpteron processor in Figure 2.2. If we are dealing with multi- or many-corearchitectures, the main memory will also be a shared memory. Communicationand synchronization between programs running on two or several CPU coresis done with the help of shared memories, but as there are several independentsystems acting on one space, some issues, which are described next, occurinvolving both symmetric-multiprocessing and coherence.

2.4.1 Symmetric Multiprocessing

If a computer hardware architecture is designed and built as a SymmetricMultiprocessing (SMP) system, one shared main memory is used, which isseen by all cores. All cores are equally connected to the shared memory, asseen in Figure 2.3, and as the number of CPU cores grow, the communicationbetween the CPU’s and the single shared memory will also grow. This leadsto an interconnect bottleneck as the CPU’s have to wait for the memory andconnection to be ready before they can continue working [42].

8

Figure 2.3: Simple SMP System.

2.4.2 Coherence

The other problematic part with shared memory is coherence. The memoryis accessed and modified by several cores which most likely cache data. Thismeans that the CPU copies data to its cache memory and when the data in thememory is later modified (see Figure 2.4), all the cache memories that have acopy of that memory location must also be updated in order to keep all thecopies of the data same and the whole system up-to-date. This is handled withthe help of a separate system that makes sure that the consistency betweenall the memories is maintained. The system has a coherency protocol, anddepending on the system, the protocol can be of different implementations.

The hardware coherency protocol found in some systems, like the ARMCortex-A9 MPCore, have a snoop control unit that is "sniffing" the bus to keepthe cache memory lines updated with the correct values. The ARM CPU hasa separate Snoop Control Unit (SCU) that handles and maintains the cachecoherency between all the processors [27].

Figure 2.5 and Figure 2.6 exemplifies a situation where the CPU’s cache datafrom the main memory (X). The setup is a dual-core system with a separateunit snooping the bus like illustrated in Figure 2.4. There are two commontypes of buss sniffing methods: Write invalidate and Write broadcast (writeupdate). The most common protocol is the write invalidate and a scenario withthis method is described in Figure 2.5. The other method, write broadcast (seeFigure 2.6), updates all the cached copies as a write occurs to the data that iscached. The figures describe the processor and buss activity and the contentsof the memory and the cache’s are shown after each step [2].

9

Figure 2.4: Cache Coherence in a Dual-core CPU.

1. CPU A reads memory X. A cache miss occurs as the cache is empty. Thecontent of X (0) is copied to CPU A cache.

2. CPU B reads memory X. A cache miss occurs as the cache is empty. Thecontent of X (0) is copied to CPU B cache.

3. CPU A writes 1. CPU A cache is updated to 1. A cache invalidate is setto CPU B cache’s copy.

4. CPU B reads X. A cache miss occurs as the cache is invalidated. Thecontent of CPU A cache (1) is copied to memory X and CPU B cache.

Figure 2.5: Write invalidate bus snooping protocol [2].

10

1. CPU A reads memory X. A cache miss occurs as the cache is empty. Thecontent of X (0) is copied to CPU A cache.

2. CPU B reads memory X. A cache miss occurs as the cache is empty. Thecontent of X (0) is copied to CPU B cache.

3. CPU A writes 1. CPU A cache is updated to 1. A bus broadcast occursand the content of CPU A cache and memory X is updated to 1.

4. CPU B reads X. Data located in local CPU B cache.

Figure 2.6: Write broadcast bus snooping protocol [2].

11

CHAPTER

THREE

INTERCONNECT

3.1 Introduction

A computer system consists of many hardware parts that are physicallyconnected to each other so they may exchange data. These electricalconnections inside the circuits are called the interconnects. As the trafficbetween some of the hardware parts is constant, the interconnect needs tobe able to move a huge amount of data quickly. Energy efficiency is also acommon interconnect requirement, making the desired design even harder toaccomplish, as it also has to be able to move all the data with a low amount ofenergy. The high-performance interconnect connecting the CPU and memorytogether are what we are mostly interested in, and the research concerning theinterconnect is very important performance wise. In fact, the interconnect isthe single most important factor of a computer architecture when dealing withhigh-performance computers [31].

3.2 Multiprocessing

Multiprocessing became a natural step in the evolution of computer archi-tecture, as the frequency of a single core computer was reaching maximumperformance with extremely high frequencies hard to top. The high frequencyalso makes the CPU less power efficient, and heat dissipation harder and moreexpensive. The next step was simply to put several cores on one CPU in

12

order to achieve higher performance. The effect of many-core architecturecan be seen as a stress on the interconnect and memory on a whole newscale, forcing the whole system to adapt technologically if the full advantageof the processing power of all the cores is to be utilized. Today, multi-coremicroprocessors have been used in desktop computers for some five years.The technology in mainstream computers has advanced, from two to six coreson a single die. In recent time, multi-core architecture has also been adoptedin low-power embedded systems and mobile devices, like smartphones andtablets [5].

3.3 Network Topology

When discussing the distinct arrangement of some computer parts that areinterconnected, like memory and CPU’s, the hardware parts are usually exem-plified as nodes and the physical relationship, i.e. the interconnect, betweenthe nodes are drawn as branches. A specific setup of nodes and branchesrepresent one Network Topology. Figure 3.1 shows some general topologiesused in computer architectures, and depending on system requirements onetopology might be much more efficient and suitable than another [3].

The network topology of the memory hierarchy we discussed in Section 2.2could be seen as the Linear Topology illustrated in Figure 3.1, since the memoryhierarchy connects different memory subsystems in a linear fashion. Anotherexample of a network topology demonstrated is the SMP System discussedand illustrated in Section 2.4.1. The SMP topolgy can be seen in the topologyfigure as the Bus topology.

13

Figure 3.1: Common Network Topologies [3].

3.3.1 High-performance Computing

The fastest computers in the world are ranked in the TOP500 project. Currentlythe most powerful computer is a Japanese supercomputer known as the KComputer [11]. The K Computer, produced by Fujitsu at the RIKEN AdvancedInstitute for Computational Science, implements a Tofu interconnect (seeFigure 3.2), which is a 6-dimensional (6D) custom mesh/torus architecture[25]. This means that one node somewhere in the center of the topology isconnected to the nodes on the left and right (2D), front and back (2D) and alsoto the ones that are above and under (2D), hence the 6D.

14

Figure 3.2: 6D Mesh/Torus Architecture [25].

3.3.2 Intel QuickPath Interconnect and HyperTransport

The world’s largest semiconductor chip maker, Intel [12] relays on theirown interconnect technology (see Figure 3.3), which they call QuickPathInterconnect (QPI). This is the key technology maximizing the performance ofIntel’s x86 multi-core processors. The QuickPath interconnect uses fast point-to-point link topology between CPU cores inside the processor, and outsidethe processor the same links are used to connect memory and other CPU’s [13].

Another similar interconnect using point-to-point links is HyperTransport(HT). The technology is developed and maintained by The HyperTransportConsortium. AMD’s Opteron processor (Figure 3.4) uses HyperTransportbidirectional links to interconnect several CPU’s and memory [31].

3.3.3 Arteris Interconnect

The Arteris interconnect is a NoC (Network-on-Chip) interconnect for SoC(System-on-Chip). Key characteristics is low power and high bandwidth,making it suitable for modern mobile devices implementing complex SoC dieswith an increasing amount of IP blocks. Therefore many companies havechosen Arteris as the interconnect for their devices. Also, ARM Holdings hasinvested in Arteris making it perhaps even more interesting when taking, other

15

than architectural specifications, also the business and partner perspective into acount. One of Arteris big customers is Samsung, who has selected Arterisinterconnect solutions for multiple chips. One of Arteris interconnect productsis called FlexNoC, which is designed for SoC interconnects with low latencyand high throughput requirements, and it supports the broadly used ARMAMBA protocols [14]. Another vendor using the Arteris interconnect is TexasInstruments. The L3 interconnect inside the TI OMAP 4 processors connectshigh throughput IP blocks, like the the dual-core ARM Cortex-A9 CPU [41].

3.3.4 Open Compute

Some interesting server architectural work is done by Facebook under a projectnamed Open Compute. They have designed and built an energy efficient datacenter which is cheaper and more powerful than other data-centers. The OpenCompute project uses both Intel and AMD motherboards in the servers, andboth motherboards are stripped down from many features that are otherwisefound in traditional motherboards. Still, perhaps the most exciting part is thatthis project is open source, meaning they share everything [4]. The functionalblock diagram and board placement of the Intel and AMD motherboards areillustrated in Figure 3.3 and Figure 3.4.

Both the Intel and AMD board diagrams show that the different processorshave separate main memory located near the processing units, so theymay quickly access the data. QuickPath (QPI)- and HyperTransport(HT)interconnect technologies are used in the boards as the physical connectionlinking the separate processors and memory.

16

Figure 3.3: Open Compute Motherboard based on Intel CPU and QuickPathInterconnect [4].

17

Figure 3.4: Open Compute Motherboard based on AMD CPU and Hyper-Transport Interconnect [4].

3.4 ARM Architecture

ARM Holdings is a Cambridge based company designing the popular 32-bitARM instruction set architecture. The current instruction set is ARMv7 andit is implemented in most smartphones and tablets on the market today. Oneof the key features of the ARM architecture is the excellent power efficiencywhich makes the architecture suitable for portable devices. ARM operates bylicensing its design as IP rather than manufacturing the processors themselves.Today there are several companies building ARM processors: Among others,Nvidia, Samsung and Texas Instruments [15].

18

3.4.1 Cortex-A Series

The next version of ARMs popular Cortex-A series SoC is described in ARM’swebpage as:

"The ARM Cortex-A15 MPCore processor is the highest-performance licensableprocessor the industry has ever seen. It delivers unprecedented processing capability,combined with low power consumption to enable compelling products in marketsranging from smartphones, tablets, mobile computing, high-end digital home, serversand wireless infrastructure. The unique combination of performance, functionality,and power-efficiency provided by the Cortex-A15 MPCore processor reinforces ARM’sclear leadership position in these high-value and high-volume application segments."[16]

In Figure 3.5, is an image of the upcoming processor from ARM Holdings.Some of the new features found in the new processor are shown in theblock diagram: The Snoop Control Unit (SCU) enabling the 128-bit AMBA4 Coherent Bus are perhaps some of the most interesting. Also, what isparticularly exciting about the Cortex-A15, is that the 1.5 GHz - 2.5 GHz quad-core configuration of the architecture is specifically designed for low-powerservers [16].

19

Figure 3.5: Next Generation ARM SoC [5].

3.4.2 Advanced Microcontroller Bus Architecture (AMBA)

All System-on-Chip (SoC) integrated circuits from ARM uses the AdvancedMicrocontroller Bus Architecture (AMBA) as the on-chip bus interconnect. Themost recent AMBA protocol specification is the Advanced eXtensible Interface(AXI), which is targeted at high-performance systems with a high frequencyand low-latency requirements. The AMBA AXI protocol is backward-compatible with the earlier AHB and APB interfaces. The latest AMBA 4version adds several new interface protocols. Some of these are: The AMBA4 Coherency Extension (ACE) which allows full cache coherency betweenprocessors, ACE-Lite for I/O coherency and the AXI 4 which is designed toincrease performance and power efficiency. [28].

3.4.3 Calxeda 120 x Quad-core ARM Server Chip

One promising attempt, other than the unseen future Cortex-A15 processor,to change the x86 dominated server market is done by a company namedCalxeda, which ARM Holdings has shown interest in by investing in it. They

20

are building a server chip based on ARM Cortex-A9 MPCore processors. Thearchitecture is based on a standard 2 rack unit (2U) server with 120 quad-coreARM processors. Each ARM node will only consume 5W of power, which is alot less than any other x86 server [17].

21

CHAPTER

FOUR

NON-UNIFORM MEMORY ACCESS (NUMA)

4.1 Introduction

The Non-Uniform Memory Access architecture is designed for systems withmany CPU’s. The problem with traditional Symmetric Multiprocessing (SMP)is that it does not scale very well, because all traffic between the cores andmemory goes through one place. NUMA is specifically designed to addressthis issue that occurs in large SMP systems, and solves it with an architecturewhere separate memory nodes are directly connected to all the CPU’s [42]. Asimple NUMA system is illustrated below in Figure 4.1.

Figure 4.1: The Simplest NUMA System [42].

22

4.2 Hardware

A full NUMA system consists of special hardware (CPU, motherboard) thatsupports NUMA. There are many different types of motherboard architecturesfor one CPU family. Below, in Figure 4.2, we can see four different topologiesfor the AMD Opteron CPU. The block diagram is an approximation of thereal motherboards. The interconnect between the processors follows a patternwhere all processors are connected to two other processors and the localmemory. For some architectures the interconnect is obvious (see Tyan K8QSThunder Pro S4882) but for the other architectures with a more irregular setup,the interconnect could be manufactured in different ways. As the exact blockdiagrams of the interconnect was not found, the interconnect has been left outfrom the figure.

As the distance between the cores and their remote nodes varies a lot betweenthe different architectures, the performance is also going to be different. Somemotherboard might be suitable for a specific application that doesn’t need asignificant amount of memory, but has much traffic between the memory andthe processing unit. Another architecture might be optimal for a specific serverthat has separate computing intensive applications running on all the CPU’s,where all the applications need a lot of shared memory.

23

Figure 4.2: Different Motherboard Topologies for the Quad-core AMD OpteronCPU.

A NUMA system without the NUMA hardware could basically be imple-mented with the help of virtual memory (see Section 2.2.1). Most systemshave a Memory Management Unit (MMU), which is a hardware part thatall memory transactions from the CPU’s go through. The virtual addressesfrom the CPU’s are translated by the MMU to physical addresses. Thisway a computer cluster without NUMA hardware could take advantage ofa programmable MMU and virtual memory to run a software NUMA systemwhich uses both local and remote memory, where remote memory would bethe memory of another computer connected to the same cluster.

24

4.3 Software

To achieve a fast NUMA system, the Operating System (OS) running thesoftware part, also has to be NUMA aware. It is equally important to haveNUMA aware software as it is to have the physical NUMA hardware. Thekernel, which is the main component of an OS, has to allocate memory toa process in the most efficient way. To do this, the kernel needs to knowthe distances between all the nodes and then calculate and apply an efficientmemory policy for all the processes. The scheduler is a software part of everyOS kernel that handles accesses to different resources in the system. Someschedulers, like the most recent Linux scheduler, uses different priority levelsthat are given to tasks. This way important and real-time tasks can accessresources like the processor before other task that are not that important. Theschedulers also uses load-balancing in order to evenly distribute the workforceto all the processors. This basically means that for a NUMA aware OS to workefficiently on NUMA hardware, the scheduler also needs to be able to parsethe distance information of the underlying NUMA. The tasks should then bedistributed accordingly and executed with NUMA scheduling with efficientmemory placement policy. [30].

As an example, if a fair scheduler would not be aware of an underlying Quad-core NUMA hardware with 2 GB local main memory per CPU, the tasks wouldbe evenly distributed to the four different processors, making them all work.As the OS would not be aware of the different physical locations of the mainmemory, the tasks would be executed on the processor with the most free time.This would result in an inefficient memory usage, as a random memory accesswould be 75 % of the times remote and inefficient.

4.4 Advanced Configuration and Power Interface

Some major companies (HP, Intel, Microsoft, Phoenix and Toshiba) havetogether engineered a standard specification, called Advanced Configurationand Power Interface (ACPI). It is an open standard, for the x86 architecture,which describes the computer hardware and power management to theoperating system. The ACPI defines many different tables that are filled withuseful information that the OS reads and uses. Some of these tables holdinformation about the memory and CPU’s and the distances between theseon a NUMA machine. These tables are what we are interested in and they aredescribed next [34].

25

4.4.1 System Locality Information Table

One of the two important tables in the ACPI specification concerning a NUMAhardware, is the System Locality Information Table (SLIT). The SLIT is anoptional ACPI table that holds information about the distances between all theprocessors, memory controllers and host bridges. The table holds informationabout the distances in a matrix with all the nodes. The unit of distance isrelative to the local node which has the value of 10. The distance betweennode 1 and node 4 could for instance be 34. This would mean it takes 3.4 timesmore time for node 1 to access node 4 than what it takes to access its own localmemory. Figure 4.3 gives the SLIT specification [34].

26

Figure 4.3: ACPI System Locality Information Table (SLIT) [34].

4.4.2 Static Resource Affinity Table

The other vital ACPI table needed to export NUMA information, is the StaticResource Affinity Table (SRAT). The SRAT (Figure 4.4) describes and stores alist of the physical location information about the CPU’s and memory. As the

27

Figure 4.4: ACPI Static Resource Affinity Table (SRAT) [34].

SLIT holds the distance between all the nodes, the SRAT actually describeswhich and where these nodes physically are [34].

4.5 The Linux Kernel

The Linux kernel has, since 2004, support for Non-Uniform Memory Accesson some architectures [30]. The kernel uses the ACPI SLIT definition to get thecorrect NUMA distances and then applies a certain NUMA policy to achieveoptimal performance [26].

4.5.1 NUMA Aware Scheduler

A fundamental part of every modern OS kernel is the scheduler. In a multi-core system the scheduler has to decide which process should run on whichCPU and dynamic load balancing is done by the scheduler as it can migrate

28

processes from one core to another. On a NUMA system, where memoryaccess times depend on the CPU and memory, the scheduling becomes evenmore complex, yet more important [18].

As of kernel 2.5 the scheduler has been a multi-queue scheduler whichimplements separate runqueues for every CPU core. It was called the O(1)scheduler but the scheduler still had no awareness of NUMA nodes until laterwhen parts of the O(1) scheduler and parts of another NUMA aware schedulerwere combined and resulted in a new optimal scheduler. [18].

The current Linux scheduler is namned the Completely Fair Scheduler (CFS).The CFS scheduler maximizes CPU workload and schedules tasks fairlyamong all the available cores in order to maximize performance. In situationswhere the amount of running tasks is less than the amount of logicalprocessors, the scheduling can be tuned with a power saving option. Thesched_mc_power_savings is disabled by default, but can easily be enabled with:

echo 1 > /sys/devices/system/cpu/sched_mc_power_savings

This will change the scheduler behavior, so that new tasks are distributed toother processors only when the first processors all cores are fully loaded andcan not handle any new tasks [19].

4.5.2 Memory Affinity

Memory affinity is the done when the memory is split into spaces, and thesespaces are then made accessible by predefined CPU’s. In a NUMA system thisaffinity is called node affinity where the kernel tries to keep a process and itschildren running on a local node [26].

4.5.3 Processor Affinity

In Linux a program called taskset can be used to retrieve or set process’s CPUaffinity. Using the Process IDentifier (PID) of a process the taskset utility canbe used to bypass the default scheduling applied by the Linux scheduler. Theprogram can also be used to run a command with a given CPU affinity [37].As an example, the command below sets a CPU affinity to program1, forcing it

29

to use only CPU 3

taskset -c 3 program1

4.5.4 Fake NUMA Nodes

If a system lacks NUMA hardware, the 64-bit Linux kernel can be built withoptions that enables a fake NUMA configuration. The kernel does not havefake NUMA enabled, but users can manually compile the kernel with the twofollowing options:

CONFIG_NUMA_=y and CONFIG_NUMA_EMU=y

The final step to start the Linux kernel with a fake NUMA system is to modifythe boot loader with: numa=fake=x, where x is the amount of NUMA nodes.This way the kernel splits the memory into x equally large parts. Alternativelythe size of the NUMA nodes memory can be specified with: numa=fake=x*ywhere y is the size of the memory nodes in MB. As an example we could start asystem with 4 CPU cores and 3 GB of memory. If we want to split the memoryinto 2 nodes of 512 MB each and 2 other nodes with 1 GB each, we start thekernel with:

numa=fake=2*512,2*1024

4.5.5 CPUSET

The Linux kernel includes a feature called cpuset. The cpusets provide auseful mechanism for assigning a group of processors and memory nodesto certain defined tasks. A task has a cpuset which forces the CPU andmemory placement policy to follow the current cpusets resources. The cpusetsare especially useful with large many-core systems with complex memoryhierarchies and NUMA architecture, as scheduling and memory managementbecomes increasingly hard on these systems. The cpusets represent differentsized subsystems that are especially useful on web servers running severalinstances of the same application. The default kernel scheduler uses loadbalancing across all CPU’s, which actually is not a good option for at leasttwo specific types of systems [32]:

30

1. Large systems

"On large systems, load balancing across many CPUs is expensive. If thesystem is managed using cpusets to place independent jobs on separatesets of CPUs, full load balancing is unnecessary." [32]

2. Real-time systems

"Systems supporting realtime on some CPUs need to minimize systemoverhead on those CPUs, including avoiding task load balancing if thatis not needed." [32]

Below is an example from the documentation where a sequence of commandswill setup a cpuset named "Charlie" containing CPU’s 2 and 3 and MemoryNode 1, and after that start a subshell ’sh’ in that cpuset [32]:

mount -t cgroup -ocpuset cpuset /dev/cpusetcd /dev/cpusetmkdir Charliecd Charlie/bin/echo 2-3 > cpus/bin/echo 1 > mems/bin/echo $$ > taskssh# The subshell ’sh’ is now running in cpuset Charlie# The next line should display ’/Charlie’cat /proc/self/cpuset

4.5.6 NUMACTL

When running a NUMA configured machine in Linux, the cpusets featurecan be extended with another program called numactl. As one NUMA nodetypically consists of one CPU and one memory part, the separate NUMApolicy feature is necessary, when with cpusets the CPU does not necessarilyhave local memory. Using numactl, one can set a certain NUMA policy fora file or process for the current cpuset and in that way expand the memorymanagement to be more optimized for a certain application on a NUMAarchitecture [36].

An example of using numactl, where process execute is run on node 3 withmemory allocated on node 1 and 5:

31

numactl --cpubind=3 --membind=1,5 execute

32

CHAPTER

FIVE

IMPLEMENTATION

This chapter explains the work process of this thesis. A short description ofwhat we tried to accomplish and which tools we were using, with an endingof the obtained results. A conclusion with discussion is included in the nextand final chapter.

5.1 Approach

The approach has been from the beginning to explore the capabilities of theSimics simulator and to setup a suitable target machine with Simics andevaluate the performance of certain multi-core architectures with focus on thehardware interconnection and memory topology. Later, as the NUMA designwas analyzed more accurately, the work concentrated on the Linux kernel andits NUMA features (Chapter 4.5).

5.2 Simics Full-system Simulator

Wind River Simics is a simulator capable of fully emulating the behavior of acomputer hardware, meaning one can install and run unmodified operatingsystems and software on supported architecture (x86, ARM, MIPS, ...) [20].

33

5.2.1 History and Features

Wind River, a subsidiary of Intel Corporation, has been producing softwarefor embedded systems since 1981 and has its technology in more than onebillion products. Simics was first developed in the Swedish Institute ofComputer Science and later in 1998 Virtutech was established so commercialdevelopment of Simics could be started. In 2010 Virtutech was acquired byIntel and Simics is now marketed through Wind River Systems [21].

Simics is a fast simulator that has the capabilities to save the state of asimulation. This state can later be opened again and simulation from the samemoment can be continued. Every register and memory location can easilybe viewed and modified and the whole system can even be ran backwardsin order to find a bug. A system can conveniently be built with a script filewhich can include information like memory size, CPU frequency, core countand network definitions like IP and MAC address. A system with two coresand cache memory running a program in Linux is being monitored (Figure 5.1)with the help of the data caches. Both cores have private data caches, and theread and write transactions to the caches are being plotted. This way a usercan visualize the system better and see which CPU core is doing all the work.In Figure 5.1 the program is single threaded, so all calculations seem to first beexecuted on Core-0 and later, at approximately 62007 simulated seconds, thethread is migrated to Core-1.

Figure 5.1: Cache Read and Write of Core-0 and Core-1.

It is possible to run several targets at the same time in Simics and these can, forinstance, easily be interconnected with Ethernet. The minimum latency for adistributed system can be changed in Simics Command Line, and if we wouldwant a system with 50 milliseconds latency, the following command should be

34

used:

running> set-min-latency 0.05

The traffic in a distributed system can be monitored in Simics when enablingpcapdump capture. At one point we looked at the traffic between twoErlang nodes running on different targets, communicating with messagepassing. The pcapdump was enabled on the service node which provided thevirtual network. The traffic data was written to a file, which we analyzed onthe host OS in a program called Wireshark.

The idea was to get a more accurate understanding of the traffic that occursin a distributed Erlang system. In order to research and evaluate differentinterconnects, a deeper understanding of the whole system and traffic needsto be carefully examined. But, as it would have required a lot of more timeand resources to fully achieve the task of correctly analyzing the traffic in adistributed Erlang system running cloud based server applications, the workwas canceled. Still, as the feature was found and tested to work, the knowledgecould help give directions with future work.

5.2.2 Working With ARM Architecture

As earlier research in the Cloud Project has proposed a low-power server ar-chitecture implementing the ARMv7 instruction set, we used at the beginningARM targets in Simics. Some difficulties was met upon during the work withthe ARM target machine, which are described below:

1. Full-featured Distribution

The need for a full-featured operating system running on the ARMtarget was apparent, as the benchmarks we are using in our research areimplemented in Erlang programming language, which requires, amongother things, GNU components and Perl. The target machine availablewas a Single-Board Computer (SBC) from Technologic Systems (TS-7200)running a compact Linux distribution with some Unix-tools availablewith BusyBox. The TS-7200 is based on an old ARM architecture withthe ARM9 CPU. Booting a full Linux distribution is still possible with theCompactFlash (CF) card, and Debian 3.1 Sarge is available to downloadfor the TS-7200. The installation of Debian Sarge distribution wasTherefore, the different phases of booting a full distribution in the Simics

35

ARM target, are described next. First step was to create a new empty rawfile. This was done in Linux with the dd tool:

$ dd if=/dev/zero of=/home/joaknylu/imagebs=1k count=262144

Next, a file system has to be created:

$ /sbin/mkfs.ext3 -N 20480 /home/joaknylu/image

After this we loopback mount it with:

$ sudo mount -o loop,rw /home/joaknylu/image/home/joaknylu/mount-folder/

Now the downloaded distribution can be extracted into the mount:

$ sudo tar jxvf /home/joaknylu/debian-sarge-256MB-05-01-2006.tar.bz2 -C /home/joaknylu/mount-folder/

After this, we have to unmount the file system inside the file:

$ sudo umount /home/joaknylu/mount-folder/

And finally use the Simics craff tool to compress and create the image fileneeded:

$ ./craff /image -o debian-256mb-sarge.craff

This way we have successfully compressed and converted the down-loaded distribution to a file that should be bootable in Simics. As thedistribution could not be successfully started in Simics, it turned out thatmore files were needed. The OS could not be started, as the image fileneeded in Simics to correctly boot the distribution has to be made upof several files, including, not only the distribution image but also thekernel (bzImage) and the bootloader. This made a simple sounding taskhard, and was never fully achieved.

2. Memory Management Unit (MMU)

36

The TS-7200 SBC target has Ethernet interface implemented, allowingseveral machines to be easily interconnected with Ethernet in Simics.And, as the ARM9 CPU has a Memory Management Unit (MMU),we looked at taking advantage of virtual memory (more detailedinformation in section 4.2), and implementation of a cluster NUMAarchitecture with software, was analyzed. Memory was quickly sharedamong several interconnected targets running in Simics with the use ofthe Simics Command Line utility. Using the following commands editsthe memory-space and forces one of the two targets to use both local andremote memory, where remote memory is the one located on the otherboard:

running> board_cmp0.phys_mem.del-map base = 0x0 device= board_cmp0.sdram

running> board_cmp0.phys_mem.add-map base = 0x0 device= board_cmp1.sdram offset = 0x0 length = 0x800000

Running these commands makes the target crash, so a cluster NUMAarchitecture can not be simulated this easy. The obvious problem hereis that the operating systems running on each target are not aware ofeach other, so the crashing of the kernel is expected. In order for it towork, some sort of device should have to be made, which would act asa physical device monitoring the traffic to the memory. Figure 5.2 belowillustrates how a cluster NUMA system would work, but regardless ofthe issues with the hardware simulations, the problem here still remainedwith the evident lack of a full-featured distribution.

Figure 5.2: Cluster NUMA Architecture.

3. Multiprocessing

Multiprocessing architecture is a logical part of every Cloud basedsystem, so the need to simulate such a system is a necessity. As the ARM

37

TS-7200 target only has one CPU, and no new models with ARM SMPare available in Simics (4.2) academic package, the need to build an ownARM SMP was needed in order to simulate multi-core architecture. Thiscould be done for instance with the inclusion of an own control unit thathandles the synchronization of the cores. The need for a full-featureddistribution still remains, and also a Linux Board Support Package (BSP)should have to be written in order to make the custom ARM multi-corearchitecture run. This could behave like a newer regular ARM SMP, butthe architecture would regardless not be the same.

5.2.3 gcache and trans-staller

The g-cache module is a standard cache model used in Simics. The g-cachesimulates the behavior of the cache memory with the help of the timing modelinterface. Simics has to be run in stall mode in order for the staller to havean affect. Stall mode is supported on some architectures in Simics, like x86but it is not supported for the ARM instruction set. In Figure 5.3 the g-cachemodule represents a memory hierarchy with a two level cache structure. Theid-splitter is a separate module that splits the instruction and data accesses andsends them to separate L1 caches. The splitters are there to avoid accesses thatcan cross a cache-line boundary by splitting these into two accesses. The L1caches are connected to the L2 cache, which is finally connected to the trans-staller, which simulates the memory latency [43]. This module is important, aswe are interested in the delays or latency that always occur when dealing withinterconnects.

The trans-staller module source is available in Simics, so modifications canbe done to it. This allows us to change the standard trans-staller behaviorto a more complex one. In order to simulate a NUMA behavior, the trans-staller needs to be modified so that it returns different memory latency timesdepending on which core or CPU accesses what memory space, as differentmemory spaces are located physically different.

5.3 Oracle VM VirtualBox

The Oracle VM VirtualBox is a x86 virtualization tool that allows users toinstall different guest operating systems within the program. The VirtualBoxallowed us to quickly install and recompile a Linux kernel. We installed

38

Figure 5.3: A Two Level Cache System [43].

Ubuntu Server on VirtualBox and recompiled the kernel with the NUMAemulation set. This meant that the standard configuration had to be modifiedand NUMA and NUMA_EMU had to be set before compilation. Some programsand benchmarks, including Erlang and NUMACTL were also installed. Oneof the most important reason to use a virtualization tool for this task, is thatthe whole disk image is available and can easily be converted. In this case TheVirtualBox Disk Image (.vdi) was converted with the VirtualBox command-lineinterface, vboxmanage to a a raw disk image (.raw). This makes it possible to useThe Simics craff tool and convert and compress the raw disk image to a Simics.craff image file. A Simics script is then modified to open the new craff imagefile, allowing the simulated machine to be started with the desired distributionand kernel.

5.4 Erlang

This chapter presents the Erlang programming language and explains somefeatures of the language. Erlang is designed by Ericsson and is well suitedfor distributed concurrent systems running fault-tolerant applications. Erlangruns programs fast on multi-core computers and has the ability to modifya running application without closing the program [29]. These are someinteresting features which makes Erlang suitable for cloud based servers.

39

5.4.1 Distributed Programming

Distributed programming is done in Erlang with the help of libraries.Distributed programs are implemented to run on a distributed system thatconsist of a number of Erlang runtime systems communicating with eachother. The processes running are using message passing to communicate andsynchronize with each other [29].

5.4.2 Erlang Nodes

Distributed Erlang is implemented with nodes. One can start several Erlangnodes on one machine or on different machines interconnected over Ethernet.As an example, two nodes are started on the same machine with two terminalsopen. Each node is running on a different terminal and both nodes have theirown name [29]:

Terminal-1: $ erl -sname node1

Terminal-2: $ erl -sname node2

With this setup, the user can put to use the Remote Procedure Call service (rpc)in order to, from one node, call a function located on the other node. The rpcmethod will obtain the answer from the remote node and fetch it to the callingnode.

5.4.3 Asynchronous Message Passing

As Erlang processes share no memory they communicate with messagepassing. The message passing is asynchronous, as the sender does not wait forthe receiver to be ready. The processes have a mailbox queue where incomingmessages are saved, until received by the process. In Erlang, the operator "!" isused to send messages. The syntax of "!" is [22]:

Pid ! Message

40

5.5 Results

In our simulations the inclusion of cache memory was irrelevant, so the fullg-cache module with separate timing models was not used. Instead a simplerscript was written that directly uses the trans-staller for all memory accesses:

# Connect phys_mem with NUMA staller

@staller = pre_conf_object(’staller’, ’trans-staller’)

@SIM_add_configuration([staller], None)

@conf.cosmo.motherboard.phys_mem.timing_model

= conf.staller

The last line connects the staller to the memory space. Simics is then startedin Stall execution mode, in order to make it possible to stall memory accesses.The trans-staller has been modified to simulate the NUMA behavior for thesystem. To the trans-staller was added an algorithm that checks which CPU_idaccesses what memory space (See Appendix A). The stall times returned by thetrans-staller are now adjusted to the corresponding latency of a certain NUMAhardware configuration.

In our simulations, the system is a 64 bit x86 machine with 4 x CPU’s. Theoperating system is the 64 bit Ubuntu Server 10.04 with recompiled Linuxkernel 2.6.32.41 with NUMA emulation set. The Linux kernel is started with 4 x128 MB NUMA nodes. The NUMA scheduling and memory placement policyis set with numactl and in order to ensure memory locality, the policy is simplyset to --localalloc, so only the fastest local memory is used by the CPU’s.In comparison, the tests are also ran without NUMA policy, where allocationsare made to an arbitrary node. The hardware is a symmetric NUMA systemwith three different node latencies. Below in Figure 5.4 is the NUMA hardwarewith the SLIT like table showing distance information between the nodes.

41

Figure 5.4: The hardware for Simics NUMA.

5.6 Analysis of Emark Benchmarks

A number of benchmarks, called Emark was used to evaluate or showhow changes in Linux NUMA memory allocation policies impact on thecomputational performance of a computer system. The Emark benchmarks areused to evaluate the performance of different versions of the Erlang Run-TimeSystem Application (ERTS), and see how it performs on different machines.The benchmarks have also been used to evaluate the performance of differentplatforms running the same version of ERTS. In our simulations two of theEmark benchmarks are used: Bang and Big. The first implements all to onemessage passing and the second all to all message passing. These benchmarksmeasure the performance in elapsed time (ms) [40].

The results seen below show the performance of the ERTS running on oursimulated particular hardware. All figures include three different simulations,where the first and fast simulation is ran in Simics under normal executionmode, meaning no timing model is in use for the memory accesses. Thetwo other simulations are ran under Simics stall mode, where all the memoryaccesses are stalled for a certain amount of cycles, depending on which CPUaccess what memory space. The difference between the two later simulationsran under stall mode, is that the first is run without any NUMA policy and the

42

other using the Linux user space shared library interface, called libnuma andin particular the command line utility numactl, to control the NUMA policy.The NUMA policy we are using for the second simulation, is to allocate on thelocal node. The benchmarks have been run several times in order to identifyodd or inconsistent behavior. The results presented here are the data acquiredfrom one random run of the benchmarks, as the different runs did not showany significant variance between themselves.

Below, in Figure 5.5 we can see results from the Emark Big benchmark. Thebenchmark is run on only one core with the use of Linux taskset CPU affinityutility, and we can see that the benchmark is finished much faster undernormal Simics mode, just as expected. Under stall mode the difference isclear between the two different executions. Without any NUMA policy thebenchmark takes approximately two times longer to finish as it allocatesmemory from slower nodes and as the other run uses NUMA policy andallocates only from the fast local node, the benchmark finishes much faster.

Figure 5.5: All to all message passing on one core.

The second Figure shows how fast the benchmark finished with all 4 CPU’s inuse. Here we can see similar ratio between the speed of the two executions aswith the previous figure with only one core.

43

Figure 5.6: All to all message passing on four cores.

A comparison between the two first figures can be seen below in Figure 5.7.The results indicate that the Big benchmark is about five to six times fasterwhen run on stall execution mode on the quad-core machine when using allcores compared to using only one. With local memory allocation policy set,using four cores is about four times faster than when only using one core. Thisis almost exactly the same with the normal execution mode, where memoryaccesses happen without any stalling. The lines are almost identical becausethe stalling factor can be excluded from the performance ratio:

Performance Ratio =run1_single_corerun1_quad_core

≈ run2_single_core x((((((stall_localrun2_quad_core x((((((stall_local

Similar results can be seen with the Bang benchmark. On one core the resultis clear and consistent. We execute the benchmark two times faster whenallocations are made on the local node.

Again, we can see in Figure 5.9 that the local allocation gives the sameperformance advantages over random allocation, when using all four cores.

44

Figure 5.7: Comparison of one core and four cores (Big).

Figure 5.8: All to one message passing on one core.

Comparing the speed of the single-core setup with the quad-core setup, showsthe increased performance of a factor between 1.9 and 2.8.

45

Figure 5.9: All to one message passing on four cores.

Figure 5.10: Comparison of one core and four cores (Bang).

From these results we can see that the modifications done to the functionreturning stall times in the trans-staller module works somewhat as expected.The NUMA policy set with the command line utility numactl also works as

46

expected, as does the Linux taskset utility. This shows that the Linux kernelhas some very useful features, that are easily accessible and available.

The simulation environment provided by the Simics full-system simulator isvery useful and despite the lack of support for interconnect simulation, thiswork shows that Simics can be modified and used for evaluation of systemswith complex memory architectures.

47

CHAPTER

SIX

CONCLUSIONS AND FUTURE WORK

6.1 Conclusions

Earlier research has suggested how low-power multi-core processors im-plementing the ARMv7 instruction set architecture could also be used ascomputational resources in cloud computing. The benefit in using the ARMv7architecture in cloud servers, is the efficient energy use the architectureimplements. Disadvantages with the ARM architecture compared to otherarchitectures that are being used today in server based computing, is that inorder to execute the same amount of incoming requests, the physical amountof processing units has to be more than ten times that of the other architectures[40].

This has lead the research towards the interconnect and especially the memoryarchitecture. In order to achieve good performance with large systems, thesetup and topology of the processing units and memory has to be examined.This work has concentrated on the NUMA architecture and looked into theLinux kernels support for it. The Simics simulator has been used as theenvironment to setup the memory architecture, to allow the evaluation ofthe performance of different processor- and memory affinity features foundin Linux. The Erlang Emark benchmarks was used to show the affect themodified Simics timing model has on the performance, in combination withthe use of different memory- and CPU policies in Linux.

48

As the job of migrating tasks between CPU cores is expensive [32], the useof simpler methods are worth mentioning. What is also notable is that cachemisses occur when tasks are migrating. If the task would efficiently be placedon one core, and stay there, the cache usage would be much more efficient andhence, the task would execute fast [38]. This is why, in terms of optimal powerefficiency for large systems, the use of cheap memory- and processor affinityinstead of more expensive load balancing should be considered.

6.2 Future Work

In order to optimize the use of memory and processing units, in terms of powerconsumption and performance in large systems with Non-Uniform MemoryAccess (NUMA) architecture, this work has analyzed the different techniquesavailable in Linux to control and modify the default memory- and processorplacement policies. For this purpose the Simics simulator has been modifiedand used. Other ongoing research currently in the Cloud project is to optimizethe default Linux schedulers power efficiency. As the next step, that workcould be engaged with parts of this work (see Figure ??).

Figure 6.1: Future Work Illustrated.

49

The modified power efficient Linux kernel has currently only been tested ina scheduler simulator. That kernel could be tested in Simics with differenttiming models and also be compared to, or used in parallel with other Linuxprocessor- and memory placement policies, in order to find an optimal solutionfor executing cloud server applications efficiently.

50

BIBLIOGRAPHY

[1] Quad-Core AMD Opteron Processorhttp://www.amd.com/us/products/server/processors/opteron/Pages/3rd-gen-server-product-brief.aspx. Online. Read 3 October 2011.

[2] Bus Snooping Protocolshttp://www.ece.unm.edu/ jimp/611/slides/chap8_2.html. Online.Read 21 January 2012.

[3] Network Topologyhttp://www.atis.org/glossary/definition.aspx?id=3516. Online. Read 4October 2011.

[4] Open Compute Projecthttp://opencompute.org/. Online. Read 5 October 2011.

[5] ARM Cortex-A15http://www.arm.com/products/processors/cortex-a/cortex-a15.php.Online. Read 4 July 2011.

[6] TIVIT Oy Homepage.http://www.tivit.fi/en/. Online. Read 27 September 2011.

[7] Cloud Software Program.http://www.cloudsoftwareprogram.org/cloud-program. Online. Read27 September 2011.

[8] C Programming Expert.comhttp://www.cprogrammingexpert.com. Online. Read 30 September 2011.

[9] Nokia N9http://europe.nokia.com/find-products/devices/nokia-n9/specifications. Online. Read 3 October 2011.

51

[10] Apple iMachttp://www.apple.com/imac/specs.html. Online. Read 3 October 2011.

[11] TOP500 Fastest Computerhttp://www.top500.org/lists/2011/06. Online. Read 4 October 2011.

[12] iSuppli Top-25 Semiconductor Suppliers in 2010http://www.isuppli.com/PublishingImages/Presspg. Online. Read 5October 2011.

[13] Intel QuickPath technologyhttp://www.intel.com/content/www/us/en/io/quickpath-technology/quickpath-technology-general.html. Online. Read 5 October2011.

[14] Arteris Homepage - Customershttp://www.arteris.com/customers#Samsung_Mobile_Wireless_NoC.Read 13 December 2011.

[15] Wikipedia ARM Holdingshttp://en.wikipedia.org/wiki/ARM_Holdings. Online. Read 21 January2012.

[16] ARM Cortex-A15 Processorhttp://www.arm.com/products/processors/cortex-a/cortex-a15.php.Online. Read 5 October 2011.

[17] Calxeda Server Nodehttp://www.calxeda.com/products.php. Online. Read 5 October 2011.

[18] NUMA Aware Scheduler Extensionshttp://lse.sourceforge.net/numa/scheduler/. Online. Read 10 October2011.

[19] Less Watts, Saving Power With Linux, Processorhttp://www.lesswatts.org/tips/cpu.php. Online. Read 17 October 2011.

[20] Wind River Simics Homepagehttp://www.windriver.com/products/simics/. Online. Read 17 October2011.

[21] Wikipedia Simicshttp://en.wikipedia.org/wiki/Simics. Online. Read 17 October 2011.

[22] Erlang Users’s Guide 3.2 Message Passinghttp://www.erlang.org/doc/getting_started/conc_prog.html. Online.Read 18 October 2011.

52

[23] Cortex -A8 Revision: r2p0 Technical Reference Manual.

[24] Mostafa Abd-El-Barr and Hesham El-Rewini. Fundamentals of ComputerOrganization and Architecture. 2005.

[25] Yuichiro Ajima, Shinji Sumimoto, Toshiyuki Shimizu, and Fujitsu. Tofu:A 6d mesh/torus interconnect for exascale computers. Technical report,2009.

[26] SUSE Labs Andi Kleen. An numa api for linux. Technical report, 2004.

[27] ARM. Cortex-A9 MPCore Technical Reference Manual, revision: r2p0edition.

[28] ARM. AMBA AXI Protocol Specification, 1.0 edition, 2004.

[29] Joe Armstrong. Programmin Erlang Software for a Concurrent World. 2007.

[30] Inc. Christoph Lameter, Ph.D. Silicon Graphics. Local and remotememory: Memory in a linux/numa system. Technical report, 2006.

[31] HyperTransport Consortium. Hypertransport technology - generaloverview. Technical report, 2008.

[32] Simon Derr. CPUSETS. Linux kernel documentation, 2004.

[33] Peter Grun, Niki Dutt, and Alexandru Nicolau. Memory ArchitectureExploration for Programmable Embedded Systems. 2003.

[34] Hewlett-Packard Corporation, Intel Corporation, Microsoft Corporation,Phoenix Technologies Ltd., Toshiba Corporation. Advanced Configurationand Power Interface Specification, 4.0a edition, April 2010.

[35] Randall Hydehttp://webster.cs.ucr.edu/AoA/Windows/HTML/MemoryArchitecture.html.The Art of Assembly Language. September 2011. Online. Read 28 September2011.

[36] Linux numactl man page. NUMACTL.

[37] Linux taskset man page. TASKSET.

[38] J. Mottok M. Deubzer. Migration Overhead in Multiprocessor Scheduling.University Politehnica.

[39] Michael Matz, Jan Hubicka, Andreas Jaeger, and Mark Mitchell. System VApplication Binary Interface AMD64 Architecture Processor Supplement DraftVersion 0.99.5, 2010.

53

[40] Olle Svanfeldt-Winter. Energy efficiency of arm architectures for cloudcomputing applications. Master’s thesis, 2011.

[41] Texas Instruments. OMAP4430 Multimedia Device Silicon Revision 2.x, july2010 - revised september 2010 edition.

[42] Intel Corporation Tony Luck, Principal Engineer. Linux scalability in anuma world. Technical report.

[43] Virtutech Simics Version 4.2. Advanced Memory and Timing in Simics, 3102edition, 2011.

54

SWEDISH SUMMARY

SIMULERING AV OLIKFORMIG MINNESÅTKOMSTS-ARKITEKTUR FÖR DATORMOLNSAPPLIKATIONER

Introduktion

Det som har skett de senaste åren gällande processorarkitekturens utvecklingär att antalet processorkärnor konstant vuxit. Detta har resulterat i en ökningav beräkningsförmåga, men för att fullt utnyttja processorkapaciteten somfinns, måste även resten av systemet, både hårdvaran och mjukvaran, uppdat-eras. En av de främsta faktorerna som påverkar systemets prestanda är minnet,och framför allt hur det är sammankopplat med flerkärniga processorer.Detta arbete undersöker NUMA-arkitektur, ett visst system som används föratt sammankoppla flerkärniga processorer och minne, och evaluerar Linux-operativsystemets funktionalitet och stöd för denna arkitektur. Dessa testasi en simulerad miljö som har förändrats så att det beter sig som ett riktigtNUMA-system. Arbetet utvärderar inte bara Linux-kärnans funktionalitet,utan analyserar också själva simuleringsmiljöns tillämpningsmöjligheter. Slu-tarbetet utförs inom ramen för Cloud Software Program som är ett fyraårigt(2010-2013) forskningsprogram som leds av Tieto- ja viestintäteollisuudentutkimus TIVIT Oy [6].

Sammankoppling av minne och processor

Konventionellt sammankopplas minne och processor med någon slags buss.Denna komponent finns till exempel i vanliga SMP (eng. Symmetric Multipro-

55

cessing) -system, som är system med mer än en processorkärna, och hanterardär bland annat trafiken mellan processor och minne. Problem med SMP-arkitektur uppstår då antalet processorenheter växer till alltför många, vilketresulterar i att trafiken mellan processorerna och minnet också ökar. Eftersomall dataöverföring sker via samma buss till och från ett och samma minne,hinner varken minnet eller bussen utföra arbetet tillräckligt fort, vilket ledertill att processorerna måste vänta på data istället för att jobba [42]. En lösningtill detta är NUMA (eng. Non-Uniform Memory Access) -arkitektur. NUMA-system implementerar en olikformig minnesarkitektur där varje processorhar eget lokalt minne. Detta innebär att varje processor hastigt kan överförainformation till och från det lokala minnet och detta kan dessutom utförasparallellt, då varje processor har en egen buss till sitt lokala minne. Ifall detlokala minnet inte räcker till eller de data som behövs inte finns där, kanprocessorn enkelt använda ett avlägset minne som är lokalt till någon annanprocessor. Det minnet kommer att vara långsammare, men kan användas somvanligt trots det. Då ett avlägset minne accesseras blir latensen högre, pågrund av att den fysiska sträckan blir längre och trafiken måste passera fleraminneskontroller [42].

Verktyg och resultat

Under arbetets gång har främst ett verktyg använts: Simics. Detta är enfullständig datorsystemssimulator som från början utvecklades av SwedishInstitute of Computer Science och som senare blivit en del av Wind River[21]. Programmet stöder flera olika processorarkitekturer (x86, ARM, MIPS, ...)vilket gör det möjligt att köra oförändrade operativsystem och programvaradirekt på olika simulerade arkitekturer [20]. Vårt intresse har varit att arbetamed ARM-arkitektur, då tidigare forskning inom projektet visat att dennaarkitektur är mycket energieffektiv och lämpar sig således för stora och billigadatacenter som används för att köra datormolnsapplikationer. En utmaningsker dock för att om man ska åstadkomma samma beräkningsprestanda somfinns i en modern serverprocessor som används idag i de flesta servrar, måstemer än tio gånger större processorantal användas [40]. Därför har nu framförallt den fysiska sammankopplingen varit i fokus, och vi har varit intresseradeav att undersöka olika sätt att koppla ihop flera processorer och minne.

Under arbetets gång med ARM-arkitektur i Simics uppstod dock fleraproblem:

• Brist på fullständig Linuxdistribution

56

• Stöd för multiprocessorarkitektur

• Brist på minneshanteringsenhet

Det var viktigt att kunna köra en fullständig Linuxdistribution på densimulerade maskinen för att kunna köra program och undersöka NUMA-funktionalitet. Dessa kunde alltså inte testas, eftersom den maskinen medARM-arkitektur som fanns att köra i Simics var en SBC (eng. Single BoardComputer). En SBC är ett slags inbyggt system eller en enkel dator på ettlitet kretskort, som har en mindre effektiv processor och ett operativsystemmed få funktioner. På grund av detta beslöt vi oss för att simulera annanarkitektur i Simics med mera möjligheter, och genom att använda en simuleradmaskin med x86-arkitektur blev det möjligt för oss att köra en fullständigLinuxdistribution med stöd för flera processorer och NUMA-emulation.

För att kunna simulera ett NUMA-system i Simics, måste alla minnesåtkoms-ter fördröjas. Dessutom räcker det inte att alla tillträden dröjer med ettvisst värde, utan fördröjningen bör motsvara samma väntetid som skulleuppkomma i ett riktigt system där sammankopplingens fysiska längd varierar.Detta åstadkom vi med att modifiera en standardmodul i Simics som finns föratt returnera olika dröjvärden. En algoritm lades till som ser vilken processorsom tillträder vilket minne, och beroende på deras simulerade placering ladesolika dröjningsreturvärden.

Testprogram kördes sedan med den simulerade arkitekturen för att värderaden simulerade minnesarkitekturens beteende. Olika verktyg som finns iLinux användes även för att evaluera dess prestanda. Resultaten visade attde verktyg som finns tillgängliga i Linux kunde bra användas i ett NUMA-system för att uttnyttja korrekt den underliggande hårdvaran. Dessutomfungerade den modifierade Simicsmodulen som väntat, vilket kan visa sigvara nyttigt i framtiden, ifall vi ska testa hur hårdvaran inverkar på olikamjukvarukonfigurationer. Ett annat arbete som pågår inom projektet försökeroptimera Linux-operativsystemets standardschemaläggare genom att göraden mer energieffektiv. Denna kunde till exempel testas i Simics och användaden modifierade modulen.

Slutsats

Idag är migration ett vanligt tillvägagångssätt som används inom flerkärnigasystem för att sprida arbeten som processorerna bör utföra. Detta används

57

antingen för att sprida ut arbetet jämnt för att öka prestandan eller för att läggaalla arbeten till så få processorer som möjligt för att öka energieffektiviteten.Problem med migration och utspridning av arbeten är att de kräver mycketarbete och resurser och är således dyra [38]. Därför är det värt att ävenanalysera system där migration minimeras och arbeten placeras istället tillförhandsbestämda processorer där de tillåts att använda endast lokalt minnesom ligger fysiskt nära. Då systemen växer blir migration allt dyrare, ochsystem med enkel processor- och minneshantering allt mera lönsamma ochman borde därför överväga att använda dem [32].

58

APPENDIX

A

SIMICS STALLER MODULE IMPLEMENTATION

The C-code on the next page shows parts of the modified Simics staller moduleimplementing NUMA behavior for 4 CPU’s:

59

///////////////////////////////////////////////////////////////////////////////////////

#define LOCAL_ACCESS_LATENCY 10 // One hop (10 cycles)#define NEIGHBOR_ACCESS_LATENCY 20 // Two hops (20 cycles)#define REMOTE_ACCESS_LATENCY 30 // Three hops (30 cycles)#define OFF_CHIP_ACCESS_LATENCY 40 // Four hops (40 cycles

#define MEM_START_0 0 // Start Node0 (0 MB)#define MEM_START_1 134217728 // Start Node1 (128 MB)#define MEM_START_2 268435456 // Start Node2 (256 MB)#define MEM_START_3 402653184 // Start Node3 (384 MB)#define MEM_START_4 536870912 // Start Off-chip (512 MB)

cycles_tst_operate(conf_object_t *mem_hier, conf_object_t *space, map_list_t *map, generic_transaction_t *mem_op){ simple_staller_t *st = (simple_staller_t *) mem_hier;

physical_address_t pa = mem_op->physical_address;

mem_op->reissue = 0; mem_op->block_STC = 1;

int processor_id;

if (mem_op->may_stall) // Memory accesses stallable? { if ((pa>=MEM_START_0)&&(pa<MEM_START_4)) // DRAM (0-512 MB) { if (SIM_mem_op_is_from_cpu(mem_op)) // Memory transaction from CPU { processor_id = SIM_get_processor_number(mem_op->ini_ptr);

if (processor_id == 0) // CPU_0 { // Local access (Node0) if ((pa >= MEM_START_0)&&(pa < MEM_START_1)) { return LOCAL_ACCESS_LATENCY; } // Neighbor access (Node1, Node2)

else if ((pa >= MEM_START_1)&&(pa < MEM_START_3)) { return NEIGHBOR_ACCESS_LATENCY; }

// Remote acces (Node3) else { return REMOTE_ACCESS_LATENCY; }

} else if (processor_id == 1) // CPU_1 { // Local access (Node1) if ((pa >= MEM_START_1)&&(pa < MEM_START_2)) { return LOCAL_ACCESS_LATENCY; } // Neighbor access (Node0, Node3)

else if (((pa >= MEM_START_0)&&(pa < MEM_START_1))|| ((pa >= MEM_START_3)&&(pa < MEM_START_4)))

{ return NEIGHBOR_ACCESS_LATENCY; }


} else if (processor_id == 2) // CPU_2

{ // Local access (Node2)

if ((pa >= MEM_START_2)&&(pa < MEM_START_3)) { return LOCAL_ACCESS_LATENCY; } // Neighbor access (Node0, Node3)

else if (((pa >= MEM_START_0)&&(pa < MEM_START_1))|| ((pa >= MEM_START_3)&&(pa < MEM_START_4)))

{ return NEIGHBOR_ACCESS_LATENCY; }


} else // CPU_3

{ // Local access (Node3) if ((pa >= MEM_START_3)&&(pa < MEM_START_4)) { return LOCAL_ACCESS_LATENCY; } // Neighbor access (Node1, Node2)

else if ((pa >= MEM_START_1)&&(pa < MEM_START_3)) { return NEIGHBOR_ACCESS_LATENCY; }


} } else // Memory transaction is not issued by CPU { return 0; } } else // Off-chip memory reference { return OFF_CHIP_ACCESS_LATENCY; } } else // Not stallable { return 0; }}

///////////////////////////////////////////////////////////////////////////////////////

APPENDIX

B

EMARK RESULTS

Below are the Emark benchmark results:

Figure B.1: Emark Benchmark Results.

62

simulating non-uniform memory access architecture for...

Documents