jennifer moore pipeline - luke varner · jennifer moore pipeline pipelining is an instruction set...

Jennifer Moore

Pipeline

Pipelining is an instruction set in the Xeon Phi Processor that lists several steps to a fetch and

execute cycle. Today’s computers are able to process faster by utilizing pipelining, which makes

processing multiple instructions simultaneously possible. Pipelining can be described as the basic path

through the design of any computer. Shantanu Dutt explains pipeline as a “concept in which the entire

processing flow is broken up into multiple stages, and a new data/instruction is processed by a stage

potentially as soon as it is done with the current data/instruction, which then goes onto the next stage

for further processing” (Dutt, 2001). Comparing it to the Very Long Instruction Word operation, it is very

similar in the concept of using parallelism and having different steps that work together and paralleled.

The user can operate more than one instruction set at the same time.

These steps or instructions help describe the different steps to operate and perform a fetch and

execute cycle. For example, each step can be known as the instructions that are given to the Little Man

Computer so it can execute an operation. Each of these steps is timed and can result in a delay

depending on the results of each step (Englander, 2009). Pipelining is similar to the Little Man

Computer, however, LMC uses different operation codes or instruction sets to determine the outcome

of a number that is input by the user. On the other hand, pipelining has different instruction sets to set

up a path to request data on a timed basis.

(Englander, 2009).

Disk Cache

A Disk Cache is made up of the main memory or integrated memory within most of the new disk

drives. The disk cache makes it possible to access information from the disk faster by storing frequently

used data in temporary memory so that is promptly accessible. Englander express “when a disk read or

write request is made, the system checks the disk cache first. If the required data is present, no disk

access is necessary; otherwise, a disk cache line made up of several adjoining disk blocks is moved from

the disk into the disk cache area of memory” (Englander, 2009). Caching allows the system to

temporally store commonly used data where it can be quickly accessed without accessing.

In the diagram below, it shows that the server accessing data from the disk cache. If the server

finds the data requested in the disk cache, it does not have to access the disk. When the data is

accepted and stored, the Vickovic, Celar, Mudnic article explains that “when the request is stored, the

amount of free space on Disk Cache is decreased and it is pushed on cache queue” (Vickovic, Celar &

Mudnic, 2011). Data from the disk cache located in the Xeon Phi processor can be transmitted faster

than actually reading data directly from the drive itself.

(Vickovic, Celar & Mudnic, 2011)

** (similar to the one from this article)**

Very Long Instruction Word (VLIW)

A very long instruction word is a component used in the Xeon Phi processor that provides

instructions for programs to perform efficiently. According to the Englander text, the main purpose of

this architecture is “to increase execution speed by processing instruction operations in parallel”

(Englander, 2009). Working on a high-speed program, VLIW would have to be used by the user. VLIW

consist of numerous processors that help enhance the process of the program in order to successfully

run faster. Binu Mathew explains the VLIW as “one particular style of processor design that tries to

achieve high levels of parallelism by executing long instruction words composed of multiple operations”

(Philips, 2008). To have a CPU that runs fast and efficient in running programs, then one needs to get

one with a Very Long Instruction Word processor.

VLIW can be characterized by a processor known as the Transmeta Crusoe, which is a processor

design. The Transmeta Crusoe consists of different instructions. Englander explains it as “a 128 bit

instruction word called molecule. The molecule is divided into four 32-bit atoms. Each atom represents

an operation similar to those of a normal 32-bit instruction word” (Englander, 2009). The diagram

below demonstrates the 128 bit instruction Englander explains in the text. Compared to the LMC, they

both perform a fetch and execute cycle. Each can add, load, branch on condition, and store numbers.

There are four operations that the atoms are used in the instruction word. These atoms collaborate to

complete the execution cycle. By using parallelism, there are two cycles that work simultaneously.

(Englander, 2009).

References

Pipeline

Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed., p.

253). Wiley.

Dutt, S. (2001). Pipeline basics-lecture notes #14 Retrieved from

http://www.ece.uic.edu/~dutt/courses/ece366/lect-notes.html

Disk Cache

Vickovic, L., Celar, S., & Mudnic, E. (2011). Disk array simulation model development.

Retrieved from http://ehis.ebscohost.com.proxygsu-

sct1.galileo.usg.edu/eds/pdfviewer/pdfviewer?sid=1eb5e2fd-ad53-4fad-8dc2-

8357a74e92b8@sessionmgr14&vid=6&hid=101

Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed.,

p. 263). Wiley.

Very Long Instruction Word

Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed.,

p. 244). Wiley.

Philips. (2008). An introduction to very-long instruction word (vliw) computer architecture. Retrieved

from http://twins.ee.nctu.edu.tw/courses/ca_08/literature/11_vliw.pdf

http://www.ece.uic.edu/~dutt/courses/ece366/lect-notes.html

http://ehis.ebscohost.com.proxygsu-sct1.galileo.usg.edu/eds/pdfviewer/pdfviewer?sid=1eb5e2fd-ad53-4fad-8dc2-8357a74e92b8@sessionmgr14&vid=6&hid=101



http://twins.ee.nctu.edu.tw/courses/ca_08/literature/11_vliw.pdf

IT5200 Kornchai Anujhun

1

Ring Bus

Ring bus is a substation switching arrangement that may consist of four, six, or

more breakers connected in a closed loop, with the same number of connection points.

Figure1 depicts the layout of a ring bus configuration, which is an extension of the

sectionalized bus. In the ring bus a sectionalizing breaker has been added between the

two open bus ends. In other words, there is a closed loop on the bus with each section

separated by a circuit breaker. This provides greater reliability and allows for flexible

operation.

Figure 1 Ring bus

Figure2 4-Breaker Ring Bus in ATI Graphic Card


2

USB

Universal Serial Bus, also known as USB, is a standard type of connection for

many different kinds of devices. Generally, USB refers to the types of cables and

connectors used to connect these many types of external devices to computers.

The Universal Serial Bus standard has been extremely successful. USB ports and cables

are used to connect hardware such as printers, scanners, keyboards, mice, flash drives,

external hard drives, joysticks, cameras, and more to computers of all kinds, including

desktops, tablets, laptops.

In fact, USB has become so common that you'll find the connection available on

nearly any computer-like device such as video game consoles, home audio/visual

equipment, and even in many automobiles.

Many portable devices, like Smartphone, eBook readers, and small tablets, use

USB primarily for charging. USB charging has become so common that it's now easy to

find replacement electrical outlets at home improvement stores with USB ports built it,

negating the need for a USB power adapter.

Figure3 USB Connection


3

Memory Interleaving

Memory interleaving is a method to increase the speed of the high-end

microprocessors. This is a memory access technique that divides the system memory into

a series of equal sized banks. These banks are expressed in terms of n-way interleaved: 2-

way interleaving, which is using two complete address buses, 4-way interleaving, which

is using complete four address buses, and 8-way interleaving, which is using complete

eight address buses. While one section is busy processing upon a word at a particular

location, the other section accesses the word at the next location.

Figure4 2-way Interleaved Memory

In a 2-way interleaved memory system, there are two physical banks of DRAM,

but logically the system sees one bank of memory that is twice as large. In the interleaved

bank, the first long word of bank 0 is followed by the first long word of bank 1, which is

followed by the second long word of bank 0, which is followed by the second long word

of bank 1, and so on. Figure2 shows this organization for four physical banks of N long

words. All even long words of the logical bank are located in physical bank 0 and all odd

long words are located in physical bank 1.


4

References

Schuette, M. (2011, Jan 02). Intel’s Sandy Bridge I. Architecture&CPU Performance.

One Ring Bus to Master Them All. Retrieved from

http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view

&id=98&Itemid=1&limit=1&limitstart=6

Wikipedia. (2011, Dec). Network Topology. Retrieved from

http://en.wikipedia.org/wiki/Network_topology

Shimpi, A. (2010, Sep 14). Intel’s Sandy Bridge Architecture Exposed. Retrieved from

http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/4

Wikiedia. (2012, Dec). Universal Serial Bus. Retrieved from

http://en.wikipedia.org/wiki/Universal_Serial_Bus

PCMag. USB Definition. Retrieved from

http://www.pcmag.com/encyclopedia_term/0,2542,t%3DUSB&i%3D53531,00.as

p

Matloff, N. (2003, Nov 05). Memory Interleaving. Retrieved from

http://heather.cs.ucdavis.edu/~matloff/154A/PLN/Interleaving.pdf

ORNL Physics Division. Interleaved Memory. Retrieved from

http://www.phy.ornl.gov/csep/ca/node19.html

Execution Unit

An execution unit also called a functional unit is a part of CPU that preforms the

operations and calculation called for by the Branch Unit, which receives data from

the CPU. It may have its own internal control sequence unit, some registers and

other internal units such as a sub ALU. It is commonplace for modern CPUs to

have a multiple execution units, referred to as superscalar design. The simplest

arrangement is to use one, the bus manager, to manage interface and others to

perform calculations. Modern CPUs execution units are usually pipelined. To

execute an instruction, the Execution Unit must first fetch the object code byte

from the instruction queue and then execute the instruction. If the queue is

empty when the execution unit is ready to fetch an instruction byte, Execution

Unit waits for the Bus Interface Unit to fetch the instruction byte.

In coprocessor, the instructions operate on data read from the coprocessor load

data queue. Data is written back, for example to memory or a register file by

inserting the data into the out-of-order execution pipeline, either directly or via

the coprocessor store data queue which writes back the data.

Ja’afar Bilal

IT5200: Definitions Assignment

Class: Tuesday & Thursday

Due: February 17, 2013

Multi-core Processor: A computer chip that contains two or more cores within the same

framework; cores can also be referred to as a central processing unit (CPU), which executes and

reads instructions from programs.

Within any computer system, a processor is considered to be the “ brain” of the

computer. Figure 7.1 from the textbook highlights four main components within the CPU, which

include: the Arithmetic Logic Unit (ALU), Control Unit (CU), registers and an Input/Output

(I/O) interface.1 The ALU uses arithmetic and logic operations which calculates and manipulates

bits that are represented using binary code. The CU controls the instructions and data flow within

the CPU. The memory within the CPU consists of registers (not shown in Figure 7.1), which

store binary code used for computation. Finally, the I/O interface allows for the transfer of data

to peripheral devices.

The processor controls the computer by fetching and executing program instructions. In

brief, in order to execute instructions, the Program Counter (PC), located within the CU, will

send the address of the instruction it holds to the Memory Address Register (MAR). The MAR

will then send the instruction on the address bus to the RAM. The RAM then sends the

instruction on the data bus which is then sent to the Memory Data Register (MDR). The MDR

holds the address value and is copied over to the Instruction Register (IR). Now the instruction is

decoded and executed. The PC now resets.

In a multi-core processor, each core consists of the same architecture. Each core has

access to the same input and output interface, programs, and memory.1

There are a wide range of

processors used in a computer system which include: dual-core(2), quad-core(4), hexa-core(6),

and octa-core(8). See Figure 1.1 below. With multi-core processors, the workload is distributed

among the cores resulting in overall speed and efficiency of the computer system; it also allows

for computer to run multiple programs with ease. For example, if a computer has a quad-core

processor, it would be able to simultaneously run programs such as uTorrent, Skype, and

Microsoft Word, all while the anti-virus system is running in the background.

Technology companies such as Intel produce multi-core processors that contain 50 or

more cores within its framework; the Xeon Phi Knights Corner coprocessor is such a device.

Large businesses and companies such as CNN or Google may use Knights Corner to manage

large amounts of information and data within their server center.

1. The Architecture of Computer Hardware, Systems Software, & Networking. Englander,

Irv. Pg: 265.

Direct Memory Access (DMA): Allows for devices to engage in block data transfer by directly

accessing the computer’s main memory by bypassing the CPU. It is important to know that

DMA is the third method of data transfer; the first is Programmed I/O and second is Interrupt.

The Xeon Phi Knights Corner coprocessor allows for DMA. DMA is designed to ease the

workload of the CPU during data transfer. This results in faster CPU processing and allows for

an easier method of accessing the main memory.

The only time the CPU is involved in DMA is when it initiates and ends the data transfer. During

DMA, the DMA Controller takes command until the CPU is interrupted when the data transfer is

complete.1 The DMA Controller is the I/O module because it is an interface between the CPU

and I/O device.

These three conditions must be met for DMA to occur2:

1. The I/O device and the main memory must have a line of communication, which will be a bus.

2. The DMA Controller must read and write to the main memory.

3. The DMA Controller and CPU must avoid conflict.

In order for the DMA Controller to control the data transfer, it must know2:

1. The data location of I/O device.

2. The location of data in main memory.

3. The size of the data transferred.

4. Transfer from the I/O device to memory and from memory to the I/O device.

1. “Direct Memory Access (DMA),” www.ece.ubc.ca/~edc/379.jan99/lectures/lec13.pdf,

Date Accessed: February 13, 2013


Irv. Pg: 298-299.

http://www.ece.ubc.ca/~edc/379.jan99/lectures/lec13.pdf

Control Line: A type of conductor that is associated with buses.

The Xeon Phi Knights Corner coprocessor is connected to other parts of the computer

system (i.e. main memory, I/O modules, etc.) through what is known as a bus. Buses acts as a

“data highway” which make data transfer possible. Within a bus, there are several conductors

that carry electrical signals; the signals represent bits of memory. The control line is one of these

conductors; the three other types of conductors include the address, data, and power lines.

Control lines within a bus determine the read and write instructions for the data transfer. It also

specifies the number of bytes that will be transferred. A control line is necessary because there

are many different components that communicate using a bus.1


Irv. Pg: 214-215.

Dean Griffiths

IT5200 – Intro Platforms & OS

Xeon Phi Definition

Mask register

The Interrupt Mask Register or Mask register is a read and write register Within the Xeon Phi

coprocessor, it enables or masks interrupts from being triggered on the external pins of the cache

controller and choose the appropriate address depending upon which interrupt controller wanting

to be use. The register is also an eight bit register that lets you individually enable and disable

interrupts from devices on the system. Writing a zero to the corresponding bit enables that

device's interrupts. Writing a one disables interrupts from the affected device.

Instruction set architecture (ISA)

The ISA also called CPU architecture is the part of the computer architecture and is a well -

designed hardware/software interface. The characteristic include the number and types of

registers, methods of addressing memory and basic design and layout of the instruction set. It

describes the operations, modes and storage location supported by hardware, plus how to invoke

and access them. The ISA includes a specification of the set of machine language and the

commands implemented by a particular processor (Xeon Phi), and though the Instruction set

architecture (ISA) interconnects hardware and software, it is part of the Application Binary

Interface (ABI) which provides a program with access to the hardware resource and services

available in a system.

Scalar processing

Represents a class of computer processors that processes one data at a time, typically an integer

of floating point number, this is also classified as a "single instruction stream single data stream"

(SISD)

Diagram show the difference between Scalar and Superscalar processing. Where does this fit in

Xeon Phi? Well with Scalar execution, a single execution unit is performed, granted that

different execution type of branch conditions are simultaneously executed, the CPU can average

instruction executed approximately equal to the clock speed of the machine in comparison to

multiple execution units (Superscalar) where it is possible to process instructions with an average

rate of more than one instruction per clock cycle.

References:

The Art of Assembly Language. (2006). Programming. Interrupts Traps and Exceptions (Part 3):

http://www.oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_17/CH17-

3.html

Irv Englander. (2009). The Architecture of Computer Hardware, System Software, and

Networking. Fourth edition

Wikipedia. Scalar processor: http://en.wikipedia.org/wiki/Scalar_processor

loc-nguyen. Intel® Xeon Phi™ Coprocessor Developer's Quick Start Guide:

http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide

ARM. The Architecture for the Digital World. 3.3.10. Register 2, Interrupt Mask Register:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0329l/Beiihfcc.html

Allan Snavely. Instruction Set Architecture: http://www.sdsc.edu/~allans/cs141/L2.ISA.pdf

Wikipedia. Instruction set: http://en.wikipedia.org/wiki/Instruction_set

http://www.oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_17/CH17-3.html

http://www.oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_17/CH17-3.html

http://en.wikipedia.org/wiki/Scalar_processor

http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0329l/Beiihfcc.html

http://www.sdsc.edu/~allans/cs141/L2.ISA.pdf

http://en.wikipedia.org/wiki/Instruction_set

Luke Varner

February 17, 2012

Xeon Phi Terminology

Cluster Environment

A cluster takes the classic definition of the word and simply applies it to computers.

Simply put by the book a cluster refers to a collection of "loosely coupled computers configured

to work together as a unit"(Englander, 2009). The concept of a cluster environment encourages

combining your computing power together to distribute a workload across multiple devices. The

individual computers within the cluster are referred to as nodes and can be used to almost

indefinitely increase overall processing power. Cluster environment computing is primarily

"used for computation-intensive purposes, rather than handling I/O-oriented operations such as

web service or databases"("Computer cluster").

In the context of Xeon Phi, clustering is intended to have uses in super-computing where

grouping individual Xeon Phi Coprocessors together will achieve superior processing power

while reducing power consumption and energy costs. The latest commercially available version

of the Xeon Phi has 60 cores. Using just 10 individual Xeon Phi nodes together in a cluster

environment can increase processing power tremendously while taking up minimal space. With

just ten different cluster units you would be harnessing the power of 600 CPUs! In fact you

could describe the Xeon Phi processor as a cluster within a cluster do to the number of cores it

possesses. In the figure below we can see how the Xeon Phi could be used in a cluster

environment. Figure A shows a Xeon Phi processor on its own and in figure B shows how it

could be combined together and with other processors.

Xeon Phi in a Cluster Environment

[Intel Phi Inner] Retrieved from: http://www.amax.com/hpc/images/intel_phi_inner.jpg

http://en.wikipedia.org/wiki/Input/output

Hit Ratio

The hit ratio, also commonly known as the hit rate is used in measuring the cache within

in a CPU. The hit ratio is the number of records found and run when executing a particular

process to the total number of records that are available ("Hit rate"). A high hit ratio (<90%) is

equated with better CPU cache performance whereas a lower hit ratio can indicate problems with

your system configuration. When a miss is recorded it leads to a stall in the execution of

instructions which will produce visible effects on performance.

In the Xeon Phi product description of cache performance certain features are mentioned

that are designed to reduce the number of misses which effectively keeps the hit ratio high. It

works by recognizing when a hit has occurred and then queries directories in order to locate the

data in order to return it back to the correct location. Because the Xeon Phi possesses a large

number of individual cores in order to eliminate the number of cache misses it has the ability to

access all of the individual cores in order to locate the missing data. These features in the Xeon

Phi reduce data redundancy and illustrate an intuitive piece of technology that can recognize and

fix its own mistakes. In the figure below part A. demonstrates how the L2 caches are all

interconnected so that requested data can be tracked on each core. Part B. shows how the tag

data is searched to find the relevant data before moving on to the next core cache. Using this

feature should result in a high hit ratio for all of the cores and the system as a whole.

Xeon Phi Cache and Hit Ratio Performance

[Distributed Tag Directories] Retrieved from: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-

codename-knights-corner

Write Through

A write through is a process involved with storing data where information is written to

both the main memory and the cache at the same time. This method allows for quick retrieval of

information and ensures the integrity of the data against power outages or system down time. On

the other hand, due to the redundant nature of having to write to two locations before execution

of another step, a sacrifice in system speed performance must be made (Gibilisco & Rouse,

2012). The write through method emphasizes data consistency and integrity over speed.

Looking through the system developers guide for the Xeon Phi it states that only

uncacheable and write back methods are able to be used and "the other three memory forms

[write-through, write-combining, and write-protect] are mapped internally to microcontroller

behavior" (Intel Corp, 2012). What is being described here is that the latter of the three memory

forms are automatically controlled by each of the 60 individual cores in a Xeon Phi coprocessor

and cannot be altered or changed. In the diagram below you could say that A represents one of

the cores in the Xeon Phi, it shows the CPU and cache. In using the write through method the

data is written to the cache but also to the RAM which is illustrated in part B.

Write Through in a Xeon Phi Core

[Write-through cache] Retrieved from: http://www.brainbell.com/tutors/A+/Hardware/Cache_Memory.htm

Sources

Cache memory. (n.d.). Retrieved from http://www.brainbell.com/tutors/A

/Hardware/Cache_Memory.htm

Chrysos, G. (n.d.). Intel® xeon phi™ coprocessor (codename knights corner). Retrieved from

http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

Computer cluster. (n.d.). Retrieved from http://en.wikipedia.org/wiki/Computer_cluster

Englander, I. (2009). The architecture of computer hardware, systems software, & networking.

(4th ed.). Hoboken, NJ: John Wiley & Sons Inc.

Gibilisco, S., & Rouse, M. (2012, July). Write through. Retrieved from

http://whatis.techtarget.com/definition/write-through

Hit rate. (n.d.). Retrieved from http://www.answers.com/topic/hit-rate

Intel Corp. (2012, November 8). Intel® xeon phi™ coprocessor system software developers

guide. Retrieved from

Intel phi inner. (n.d.). Retrieved from http://www.amax.com/hpc/images/intel_phi_inner.jpg

IT 5200 Platforms & OS Lloyd Middlebrooks Xeon Phi Definitions Assignment February 4, 2013

Xeon Phi Definitions

Vector Processing Unit (VPU) The VPU executes instructions that take the form of vectors. Vectors are one-dimensional arrays of data. In a vector processor, a single instruction operates simultaneously on multiple data items (arrays); whereas a scalar processor processes a single data item element (integers and floating numbers) at a time. VPUs are used to perform essential numerical calculations associated with High Performance Computing (HPC). Within the Xeon processor is the Xeon Phi coprocessor. The VPU is located inside each core of the coprocessor as shown in Figure 1. The Xeon Phi's VPU features a 512-bit SIMD (Single Instruction, Multiple Data) instruction set and can execute 8 double precision or 16 single precision operations per cycle in its Arithmetic Logic Unit (ALU) as shown in Figure 2. The VPU benefits the Xeon processor by reducing the amount of fetch and decoding operations that can incur higher energy costs and increase bandwidth.

software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 1



Clock The internal clock refers to an internal signal that operates at a constant frequency in a microprocessor that regulates the rate at which instructions are executed. Essentially, the clock controls the time when each step in the instruction cycle takes place and synchronizes various computer components. The clock rate is usually measured in megahertz (MHz) and gigahertz (GHz). The Xeon Phi has a clock speed of approximately 1.1 GHz, meaning that the Xeon Phi can process 1.1 billion instruction cycles per second. A unique characteristic of the Xeon Phi is the that it has a gated clock. This means that when the Xeon processor's workload does not demand assistance from the Xeon Phi coprocessor (all four threads on a core are halted), the clock is gated, shown in red within Figure 3. Subsequently, the core is powered down after the programmed amount of time. This preserves power and ensures the Xeon processor is operating efficiently.


Thread Englander (2003) defines a thread as an individually executable part of a process (executable program). Although threads can be scheduled to run separately from other threads in the same process, they also can share memory and other resources with those associated threads. Each core of the Xeon Phi coprocessor can support four threads in hardware. This is referred to as multithreading. During multithreading, the coprocessor often switches between threads (context switching). The threads are passed between the bus highlighted between the red lines in Figure 4.


software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 4 Sources: Clock Speed. (2013). Retrieved February 17, 2013, from http://www.webopedia.com/TERM/C/clock_speed.html Englander, Irv. (2009). The Architecture of Computer Hardware, Systems Software & Networking. Process Control Management (pp. 493). Hoboken: John Wiley & Sons, Inc. Gilbert, H. (2004, December 22). Clock Speed: Tell Me When It Hurts. Retrieved February 17, 2013, from http://www.yale.edu/pclt/PCHW/clockidea.htm Intel Xeon Phi Coprocessor (codename Knights Corner). Retrieved February 17, 2013, from http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights- corner Pratyusa Manadhata, Vyas Sekar. (2013). Vector Processors. Carnegie Mellon University. Retrieved February 17, 2013, from http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/vector.pdf Product and Performance Information. Retrieved February 17, 2013, from http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html Thread. (2013). Retrieved February 17, 2013, from http://en.wikipedia.org/wiki/Thread_(computing) Vector Processor. (2013). Retrieved February 17, 2013, from http://en.wikipedia.org/wiki/Vector_processor

SIMD

Single Instruction Multiple Data is one of the classifications defined by Flyn. It is

based upon the number of concurrent instruction and data streams available in

the architecture. A computer exploits multiple data streams against a single

instruction steam to perform operations which may be naturally parallelized. An

array processor or GPU is a good example. The first use of SIMD instructions was

in vector super computers. Today, most commodity CPUs implement

architectures that features instructions for a form vector processing on multiple

data sets, typically known as SIMD.

Xeon Phi Coprocessor has 512 bit registers for SIMD operations and considered as

one of the key features. High performance codes on the Xeon Phi Coprocessor

utilize these wide SIMD instructions to extract desired performance level. The

best performance will be achieved when the number of cores, thread, and SIMD

or vector operation are used effectively. The best method to take advantage of

the 512 bit wide SIMD instruction is to write using an array notation style. When

array notation is used, the complier will utilize the SIMD instruction set.

SMP

Symmetric Multiprocessing involves a multiprocessor computer hardware

architecture where two or more identical processors are connected to single

shared Main Memory (MM) and are controlled by a single operating system. Most

common multiprocessor systems today use SMP architecture. Usually each

processor has an associated private high speed memory known as cache memory

to speed up the MM data access and to reduce the system bus traffic. Processors

may be interconnected using buses, crossbar switches or on-chip mesh network.

SMP systems allow any processor to work on any task no matter where the data

for that task are located in memory provided each task in the system is not in

execution at the same time. SMP systems can easily move task between

processors to balance the workload efficiently with proper operating system

support. Software programs have been developed to schedule jobs so that the

processor utilization reaches its maximum potential.

Xeon Phi Coprocessor runs Linux. It really is an x86 SMP on –a-chip running Linux.

Every card has its own IP address. VSMP foundation supports the Xeon Phi

Coprocessor. This revolutionary coprocessor is based on Intel architecture and is

designed for highly parallel workloads. They support the Xeon Phi Coprocessor in

two modes: processor-virtualized mode and coprocessor-aggregated mode.

Technology Definitions

GDDR Memory

GDDR5 memoryis used primarily for graphics processing and is a part of the

circuitry on a graphics board (PCI Express). Both GDDR 3 and GDDR4 and are

the current two standards in use and are differentiated by performance and price,

with DDR4 beong the highest performer along with the highest price. This not

related in any way to the numbering of processor storage.

AMD has developed a new graphics memory chip named GDDR5 keeping in suite

with the last two offerings. GDDR5 has twice the speed of GDDR3 due to a wider

bandwidth. It was orrigionally developed by AMD to use with graphocs processing

giving more speed and a higher quality. As you can see from the screen shot below,

Intel not only uses it for traditional graphics processing but has created a high speed

PCI ring interconnect to use for connecting each coprocessor and each core directly.

Refferences:

Intel -Intel® Xeon Phi™ Coprocessor

AMD Corp

ExtremeTech - GDDR5 Memory–Under the Hood

Branch History Table

A Branch Prediction circuit was developed by HP to try and predict which way an IF-

THEN-ELSE branch would prior to it’s execution. It will fetch the instruction of it’s

prediction prior to the actual location counter pointer and the instruction is

“specutavily executed”. If the choice turns out to be wrong, it is reversed and is put

back in the pipeline. The results of these operations are kept in the Branch History

Table, frequently called something similar but not necessarily the same. The larger

and more populated the table, the better the hit ratio of preditions.

I did not find a reference to this in the Xeon Phi documentation but Intel probably has

something similar as the article leads one to believe that this circuit is very commenly

used.

I found another reference in an IBM publication which referred to the whole process as

the Branch History Table and is probably the best source to read if ou have further

interest.

Sources

WIKAPEDIA: Branch predictor

T. J. Watson Research Center: Branch History Table Prediction of Moving Target

Branches Due to Subroutine Returns

Write-Back This is a performance technique where the data is when data is updated the most current copy is updated in cache rather than in processor memory. It is used when the risk of losing current data does not outweigh the gains in performance. This can be contrasted to the technique of of “Write Through” where the data is updated both in cache and main memory. This obviously is safer but also significantly slower. Sources

Microsoft: Write Back - Part of the Backup and recovery glossary

http://whatis.techtarget.com/glossary/Backup-and-Recovery

Xeon Phi Definitions

Extended Math Unit (EMU): the EMU is a shared function that deals with more

complex math operation on behalf of several Execution Unit (EU), it provides faster

implementation of the single precision transcendental functions like reciprocal, reciprocal

square root, base 2 logarithms, and base 2 exponential functions using lookup tables. It

performs this operation inside the vector processing unit. It is an implemented hardware

that can also achieve high throughput of 1 – cycle or 2 – cycle other transcendental

functions that can be derived from elementary functions. Relating it to Xeon Phi

coprocessor, it also performs the same function as Intel; it allows operations to be done in

the form of a vector fashion with high bandwidth. And also calculates the polynomial

approximations of these functions . For example, the figure below shows it operation

Function Math Tpt. Cycle Reciprocal 1/x 1 R Sqrt 1/√2 1

Exp2 2� 2 Log2 log� x 1

Figure 1: A simple EMU Calculation

References

http://software.intel.com/en-us/articles/achieving-high-performance-on-monte-carlo-european-option-on-intel-xeon-phi

http://software.intel.com/en-us/articles/case-study-achieving-superior-performance-on-black-scholes-valuation-computing-using

Instruction Unit: Implements the basic instruction pipeline, fetching instructions from

the memory subsystem, dispatching them to available execution units, and maintaining a

state history to ensure that the operations finish in order. It also used in executing

conditional branch and unconditional jump instructions. The Instruction Unit is part of

the Control Unit, which in turn is part of the CPU, which handles all the preparation of

instructions for execution. It is responsible for organizing program instructions that are to

be fetched and executed in an appropriate order. Instruction unit performs a lot of

operations in relation to both Intel and Xeon Phi coprocessor.

Figure 2: Shows how the Instruction Unit operates

This diagram was retrieve from http://openrisc.net/or1200-spec.html

References

http://openrisc.net/or1200-spec.html

http://en.wikipedia.org/wiki/Instruction_unit

Miss: This word is used in cache memory; it is usually called the Cache Miss. Miss is a

situation in which the request is not already present in cache memory, or when accessing

a memory location that is not in the cache. When this happens, the processor will then

wait for data to be fetched from the next cache or cache level or from the main memory

before continuing its execution. That is because cache misses can influence the

application`s performance directly. In relation to Xeon Phi cache miss occurs on a core

that generates address request on the address ring and then quires the tag directories. In

this case if no data is found in the tag directories then the core will have to generate

another address request and then query the memory for data. The diagram below shows

the Miss rate versus cache size on the Integer portion of SPEC CPU2000.

Figure 3: Miss Rate versus cache size on the Integer portion of SPEC CPU2000

This diagram was retrieved from http://en.wikipedia.org/wiki/CPU_cache#Cache_miss

References

http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_li

nux/manual_html/miss_ratio.html