jennifer moore pipeline - luke varner · jennifer moore pipeline pipelining is an instruction set...
TRANSCRIPT
Jennifer Moore
Pipeline
Pipelining is an instruction set in the Xeon Phi Processor that lists several steps to a fetch and
execute cycle. Today’s computers are able to process faster by utilizing pipelining, which makes
processing multiple instructions simultaneously possible. Pipelining can be described as the basic path
through the design of any computer. Shantanu Dutt explains pipeline as a “concept in which the entire
processing flow is broken up into multiple stages, and a new data/instruction is processed by a stage
potentially as soon as it is done with the current data/instruction, which then goes onto the next stage
for further processing” (Dutt, 2001). Comparing it to the Very Long Instruction Word operation, it is very
similar in the concept of using parallelism and having different steps that work together and paralleled.
The user can operate more than one instruction set at the same time.
These steps or instructions help describe the different steps to operate and perform a fetch and
execute cycle. For example, each step can be known as the instructions that are given to the Little Man
Computer so it can execute an operation. Each of these steps is timed and can result in a delay
depending on the results of each step (Englander, 2009). Pipelining is similar to the Little Man
Computer, however, LMC uses different operation codes or instruction sets to determine the outcome
of a number that is input by the user. On the other hand, pipelining has different instruction sets to set
up a path to request data on a timed basis.
(Englander, 2009).
Disk Cache
A Disk Cache is made up of the main memory or integrated memory within most of the new disk
drives. The disk cache makes it possible to access information from the disk faster by storing frequently
used data in temporary memory so that is promptly accessible. Englander express “when a disk read or
write request is made, the system checks the disk cache first. If the required data is present, no disk
access is necessary; otherwise, a disk cache line made up of several adjoining disk blocks is moved from
the disk into the disk cache area of memory” (Englander, 2009). Caching allows the system to
temporally store commonly used data where it can be quickly accessed without accessing.
In the diagram below, it shows that the server accessing data from the disk cache. If the server
finds the data requested in the disk cache, it does not have to access the disk. When the data is
accepted and stored, the Vickovic, Celar, Mudnic article explains that “when the request is stored, the
amount of free space on Disk Cache is decreased and it is pushed on cache queue” (Vickovic, Celar &
Mudnic, 2011). Data from the disk cache located in the Xeon Phi processor can be transmitted faster
than actually reading data directly from the drive itself.
(Vickovic, Celar & Mudnic, 2011)
** (similar to the one from this article)**
Very Long Instruction Word (VLIW)
A very long instruction word is a component used in the Xeon Phi processor that provides
instructions for programs to perform efficiently. According to the Englander text, the main purpose of
this architecture is “to increase execution speed by processing instruction operations in parallel”
(Englander, 2009). Working on a high-speed program, VLIW would have to be used by the user. VLIW
consist of numerous processors that help enhance the process of the program in order to successfully
run faster. Binu Mathew explains the VLIW as “one particular style of processor design that tries to
achieve high levels of parallelism by executing long instruction words composed of multiple operations”
(Philips, 2008). To have a CPU that runs fast and efficient in running programs, then one needs to get
one with a Very Long Instruction Word processor.
VLIW can be characterized by a processor known as the Transmeta Crusoe, which is a processor
design. The Transmeta Crusoe consists of different instructions. Englander explains it as “a 128 bit
instruction word called molecule. The molecule is divided into four 32-bit atoms. Each atom represents
an operation similar to those of a normal 32-bit instruction word” (Englander, 2009). The diagram
below demonstrates the 128 bit instruction Englander explains in the text. Compared to the LMC, they
both perform a fetch and execute cycle. Each can add, load, branch on condition, and store numbers.
There are four operations that the atoms are used in the instruction word. These atoms collaborate to
complete the execution cycle. By using parallelism, there are two cycles that work simultaneously.
(Englander, 2009).
References
Pipeline
Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed., p.
253). Wiley.
Dutt, S. (2001). Pipeline basics-lecture notes #14 Retrieved from
http://www.ece.uic.edu/~dutt/courses/ece366/lect-notes.html
Disk Cache
Vickovic, L., Celar, S., & Mudnic, E. (2011). Disk array simulation model development.
Retrieved from http://ehis.ebscohost.com.proxygsu-
sct1.galileo.usg.edu/eds/pdfviewer/pdfviewer?sid=1eb5e2fd-ad53-4fad-8dc2-
8357a74e92b8@sessionmgr14&vid=6&hid=101
Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed.,
p. 263). Wiley.
Very Long Instruction Word
Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed.,
p. 244). Wiley.
Philips. (2008). An introduction to very-long instruction word (vliw) computer architecture. Retrieved
from http://twins.ee.nctu.edu.tw/courses/ca_08/literature/11_vliw.pdf
IT5200 Kornchai Anujhun
1
Ring Bus
Ring bus is a substation switching arrangement that may consist of four, six, or
more breakers connected in a closed loop, with the same number of connection points.
Figure1 depicts the layout of a ring bus configuration, which is an extension of the
sectionalized bus. In the ring bus a sectionalizing breaker has been added between the
two open bus ends. In other words, there is a closed loop on the bus with each section
separated by a circuit breaker. This provides greater reliability and allows for flexible
operation.
Figure 1 Ring bus
Figure2 4-Breaker Ring Bus in ATI Graphic Card
IT5200 Kornchai Anujhun
2
USB
Universal Serial Bus, also known as USB, is a standard type of connection for
many different kinds of devices. Generally, USB refers to the types of cables and
connectors used to connect these many types of external devices to computers.
The Universal Serial Bus standard has been extremely successful. USB ports and cables
are used to connect hardware such as printers, scanners, keyboards, mice, flash drives,
external hard drives, joysticks, cameras, and more to computers of all kinds, including
desktops, tablets, laptops.
In fact, USB has become so common that you'll find the connection available on
nearly any computer-like device such as video game consoles, home audio/visual
equipment, and even in many automobiles.
Many portable devices, like Smartphone, eBook readers, and small tablets, use
USB primarily for charging. USB charging has become so common that it's now easy to
find replacement electrical outlets at home improvement stores with USB ports built it,
negating the need for a USB power adapter.
Figure3 USB Connection
IT5200 Kornchai Anujhun
3
Memory Interleaving
Memory interleaving is a method to increase the speed of the high-end
microprocessors. This is a memory access technique that divides the system memory into
a series of equal sized banks. These banks are expressed in terms of n-way interleaved: 2-
way interleaving, which is using two complete address buses, 4-way interleaving, which
is using complete four address buses, and 8-way interleaving, which is using complete
eight address buses. While one section is busy processing upon a word at a particular
location, the other section accesses the word at the next location.
Figure4 2-way Interleaved Memory
In a 2-way interleaved memory system, there are two physical banks of DRAM,
but logically the system sees one bank of memory that is twice as large. In the interleaved
bank, the first long word of bank 0 is followed by the first long word of bank 1, which is
followed by the second long word of bank 0, which is followed by the second long word
of bank 1, and so on. Figure2 shows this organization for four physical banks of N long
words. All even long words of the logical bank are located in physical bank 0 and all odd
long words are located in physical bank 1.
IT5200 Kornchai Anujhun
4
References
Schuette, M. (2011, Jan 02). Intel’s Sandy Bridge I. Architecture&CPU Performance.
One Ring Bus to Master Them All. Retrieved from
http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view
&id=98&Itemid=1&limit=1&limitstart=6
Wikipedia. (2011, Dec). Network Topology. Retrieved from
http://en.wikipedia.org/wiki/Network_topology
Shimpi, A. (2010, Sep 14). Intel’s Sandy Bridge Architecture Exposed. Retrieved from
http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/4
Wikiedia. (2012, Dec). Universal Serial Bus. Retrieved from
http://en.wikipedia.org/wiki/Universal_Serial_Bus
PCMag. USB Definition. Retrieved from
http://www.pcmag.com/encyclopedia_term/0,2542,t%3DUSB&i%3D53531,00.as
p
Matloff, N. (2003, Nov 05). Memory Interleaving. Retrieved from
http://heather.cs.ucdavis.edu/~matloff/154A/PLN/Interleaving.pdf
ORNL Physics Division. Interleaved Memory. Retrieved from
http://www.phy.ornl.gov/csep/ca/node19.html
Execution Unit
An execution unit also called a functional unit is a part of CPU that preforms the
operations and calculation called for by the Branch Unit, which receives data from
the CPU. It may have its own internal control sequence unit, some registers and
other internal units such as a sub ALU. It is commonplace for modern CPUs to
have a multiple execution units, referred to as superscalar design. The simplest
arrangement is to use one, the bus manager, to manage interface and others to
perform calculations. Modern CPUs execution units are usually pipelined. To
execute an instruction, the Execution Unit must first fetch the object code byte
from the instruction queue and then execute the instruction. If the queue is
empty when the execution unit is ready to fetch an instruction byte, Execution
Unit waits for the Bus Interface Unit to fetch the instruction byte.
In coprocessor, the instructions operate on data read from the coprocessor load
data queue. Data is written back, for example to memory or a register file by
inserting the data into the out-of-order execution pipeline, either directly or via
the coprocessor store data queue which writes back the data.
Ja’afar Bilal
IT5200: Definitions Assignment
Class: Tuesday & Thursday
Due: February 17, 2013
Multi-core Processor: A computer chip that contains two or more cores within the same
framework; cores can also be referred to as a central processing unit (CPU), which executes and
reads instructions from programs.
Within any computer system, a processor is considered to be the “ brain” of the
computer. Figure 7.1 from the textbook highlights four main components within the CPU, which
include: the Arithmetic Logic Unit (ALU), Control Unit (CU), registers and an Input/Output
(I/O) interface.1 The ALU uses arithmetic and logic operations which calculates and manipulates
bits that are represented using binary code. The CU controls the instructions and data flow within
the CPU. The memory within the CPU consists of registers (not shown in Figure 7.1), which
store binary code used for computation. Finally, the I/O interface allows for the transfer of data
to peripheral devices.
The processor controls the computer by fetching and executing program instructions. In
brief, in order to execute instructions, the Program Counter (PC), located within the CU, will
send the address of the instruction it holds to the Memory Address Register (MAR). The MAR
will then send the instruction on the address bus to the RAM. The RAM then sends the
instruction on the data bus which is then sent to the Memory Data Register (MDR). The MDR
holds the address value and is copied over to the Instruction Register (IR). Now the instruction is
decoded and executed. The PC now resets.
In a multi-core processor, each core consists of the same architecture. Each core has
access to the same input and output interface, programs, and memory.1
There are a wide range of
processors used in a computer system which include: dual-core(2), quad-core(4), hexa-core(6),
and octa-core(8). See Figure 1.1 below. With multi-core processors, the workload is distributed
among the cores resulting in overall speed and efficiency of the computer system; it also allows
for computer to run multiple programs with ease. For example, if a computer has a quad-core
processor, it would be able to simultaneously run programs such as uTorrent, Skype, and
Microsoft Word, all while the anti-virus system is running in the background.
Technology companies such as Intel produce multi-core processors that contain 50 or
more cores within its framework; the Xeon Phi Knights Corner coprocessor is such a device.
Large businesses and companies such as CNN or Google may use Knights Corner to manage
large amounts of information and data within their server center.
1. The Architecture of Computer Hardware, Systems Software, & Networking. Englander,
Irv. Pg: 265.
Direct Memory Access (DMA): Allows for devices to engage in block data transfer by directly
accessing the computer’s main memory by bypassing the CPU. It is important to know that
DMA is the third method of data transfer; the first is Programmed I/O and second is Interrupt.
The Xeon Phi Knights Corner coprocessor allows for DMA. DMA is designed to ease the
workload of the CPU during data transfer. This results in faster CPU processing and allows for
an easier method of accessing the main memory.
The only time the CPU is involved in DMA is when it initiates and ends the data transfer. During
DMA, the DMA Controller takes command until the CPU is interrupted when the data transfer is
complete.1 The DMA Controller is the I/O module because it is an interface between the CPU
and I/O device.
These three conditions must be met for DMA to occur2:
1. The I/O device and the main memory must have a line of communication, which will be a bus.
2. The DMA Controller must read and write to the main memory.
3. The DMA Controller and CPU must avoid conflict.
In order for the DMA Controller to control the data transfer, it must know2:
1. The data location of I/O device.
2. The location of data in main memory.
3. The size of the data transferred.
4. Transfer from the I/O device to memory and from memory to the I/O device.
1. “Direct Memory Access (DMA),” www.ece.ubc.ca/~edc/379.jan99/lectures/lec13.pdf,
Date Accessed: February 13, 2013
2. The Architecture of Computer Hardware, Systems Software, & Networking. Englander,
Irv. Pg: 298-299.
Control Line: A type of conductor that is associated with buses.
The Xeon Phi Knights Corner coprocessor is connected to other parts of the computer
system (i.e. main memory, I/O modules, etc.) through what is known as a bus. Buses acts as a
“data highway” which make data transfer possible. Within a bus, there are several conductors
that carry electrical signals; the signals represent bits of memory. The control line is one of these
conductors; the three other types of conductors include the address, data, and power lines.
Control lines within a bus determine the read and write instructions for the data transfer. It also
specifies the number of bytes that will be transferred. A control line is necessary because there
are many different components that communicate using a bus.1
1. The Architecture of Computer Hardware, Systems Software, & Networking. Englander,
Irv. Pg: 214-215.
Dean Griffiths
IT5200 – Intro Platforms & OS
Xeon Phi Definition
Mask register
The Interrupt Mask Register or Mask register is a read and write register Within the Xeon Phi
coprocessor, it enables or masks interrupts from being triggered on the external pins of the cache
controller and choose the appropriate address depending upon which interrupt controller wanting
to be use. The register is also an eight bit register that lets you individually enable and disable
interrupts from devices on the system. Writing a zero to the corresponding bit enables that
device's interrupts. Writing a one disables interrupts from the affected device.
Instruction set architecture (ISA)
The ISA also called CPU architecture is the part of the computer architecture and is a well -
designed hardware/software interface. The characteristic include the number and types of
registers, methods of addressing memory and basic design and layout of the instruction set. It
describes the operations, modes and storage location supported by hardware, plus how to invoke
and access them. The ISA includes a specification of the set of machine language and the
commands implemented by a particular processor (Xeon Phi), and though the Instruction set
architecture (ISA) interconnects hardware and software, it is part of the Application Binary
Interface (ABI) which provides a program with access to the hardware resource and services
available in a system.
Scalar processing
Represents a class of computer processors that processes one data at a time, typically an integer
of floating point number, this is also classified as a "single instruction stream single data stream"
(SISD)
Diagram show the difference between Scalar and Superscalar processing. Where does this fit in
Xeon Phi? Well with Scalar execution, a single execution unit is performed, granted that
different execution type of branch conditions are simultaneously executed, the CPU can average
instruction executed approximately equal to the clock speed of the machine in comparison to
multiple execution units (Superscalar) where it is possible to process instructions with an average
rate of more than one instruction per clock cycle.
References:
The Art of Assembly Language. (2006). Programming. Interrupts Traps and Exceptions (Part 3):
http://www.oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_17/CH17-
3.html
Irv Englander. (2009). The Architecture of Computer Hardware, System Software, and
Networking. Fourth edition
Wikipedia. Scalar processor: http://en.wikipedia.org/wiki/Scalar_processor
loc-nguyen. Intel® Xeon Phi™ Coprocessor Developer's Quick Start Guide:
http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide
ARM. The Architecture for the Digital World. 3.3.10. Register 2, Interrupt Mask Register:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0329l/Beiihfcc.html
Allan Snavely. Instruction Set Architecture: http://www.sdsc.edu/~allans/cs141/L2.ISA.pdf
Wikipedia. Instruction set: http://en.wikipedia.org/wiki/Instruction_set
Luke Varner
February 17, 2012
Xeon Phi Terminology
Cluster Environment
A cluster takes the classic definition of the word and simply applies it to computers.
Simply put by the book a cluster refers to a collection of "loosely coupled computers configured
to work together as a unit"(Englander, 2009). The concept of a cluster environment encourages
combining your computing power together to distribute a workload across multiple devices. The
individual computers within the cluster are referred to as nodes and can be used to almost
indefinitely increase overall processing power. Cluster environment computing is primarily
"used for computation-intensive purposes, rather than handling I/O-oriented operations such as
web service or databases"("Computer cluster").
In the context of Xeon Phi, clustering is intended to have uses in super-computing where
grouping individual Xeon Phi Coprocessors together will achieve superior processing power
while reducing power consumption and energy costs. The latest commercially available version
of the Xeon Phi has 60 cores. Using just 10 individual Xeon Phi nodes together in a cluster
environment can increase processing power tremendously while taking up minimal space. With
just ten different cluster units you would be harnessing the power of 600 CPUs! In fact you
could describe the Xeon Phi processor as a cluster within a cluster do to the number of cores it
possesses. In the figure below we can see how the Xeon Phi could be used in a cluster
environment. Figure A shows a Xeon Phi processor on its own and in figure B shows how it
could be combined together and with other processors.
Xeon Phi in a Cluster Environment
[Intel Phi Inner] Retrieved from: http://www.amax.com/hpc/images/intel_phi_inner.jpg
Hit Ratio
The hit ratio, also commonly known as the hit rate is used in measuring the cache within
in a CPU. The hit ratio is the number of records found and run when executing a particular
process to the total number of records that are available ("Hit rate"). A high hit ratio (<90%) is
equated with better CPU cache performance whereas a lower hit ratio can indicate problems with
your system configuration. When a miss is recorded it leads to a stall in the execution of
instructions which will produce visible effects on performance.
In the Xeon Phi product description of cache performance certain features are mentioned
that are designed to reduce the number of misses which effectively keeps the hit ratio high. It
works by recognizing when a hit has occurred and then queries directories in order to locate the
data in order to return it back to the correct location. Because the Xeon Phi possesses a large
number of individual cores in order to eliminate the number of cache misses it has the ability to
access all of the individual cores in order to locate the missing data. These features in the Xeon
Phi reduce data redundancy and illustrate an intuitive piece of technology that can recognize and
fix its own mistakes. In the figure below part A. demonstrates how the L2 caches are all
interconnected so that requested data can be tracked on each core. Part B. shows how the tag
data is searched to find the relevant data before moving on to the next core cache. Using this
feature should result in a high hit ratio for all of the cores and the system as a whole.
Xeon Phi Cache and Hit Ratio Performance
[Distributed Tag Directories] Retrieved from: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-
codename-knights-corner
Write Through
A write through is a process involved with storing data where information is written to
both the main memory and the cache at the same time. This method allows for quick retrieval of
information and ensures the integrity of the data against power outages or system down time. On
the other hand, due to the redundant nature of having to write to two locations before execution
of another step, a sacrifice in system speed performance must be made (Gibilisco & Rouse,
2012). The write through method emphasizes data consistency and integrity over speed.
Looking through the system developers guide for the Xeon Phi it states that only
uncacheable and write back methods are able to be used and "the other three memory forms
[write-through, write-combining, and write-protect] are mapped internally to microcontroller
behavior" (Intel Corp, 2012). What is being described here is that the latter of the three memory
forms are automatically controlled by each of the 60 individual cores in a Xeon Phi coprocessor
and cannot be altered or changed. In the diagram below you could say that A represents one of
the cores in the Xeon Phi, it shows the CPU and cache. In using the write through method the
data is written to the cache but also to the RAM which is illustrated in part B.
Write Through in a Xeon Phi Core
[Write-through cache] Retrieved from: http://www.brainbell.com/tutors/A+/Hardware/Cache_Memory.htm
Sources
Cache memory. (n.d.). Retrieved from http://www.brainbell.com/tutors/A
/Hardware/Cache_Memory.htm
Chrysos, G. (n.d.). Intel® xeon phi™ coprocessor (codename knights corner). Retrieved from
http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner
Computer cluster. (n.d.). Retrieved from http://en.wikipedia.org/wiki/Computer_cluster
Englander, I. (2009). The architecture of computer hardware, systems software, & networking.
(4th ed.). Hoboken, NJ: John Wiley & Sons Inc.
Gibilisco, S., & Rouse, M. (2012, July). Write through. Retrieved from
http://whatis.techtarget.com/definition/write-through
Hit rate. (n.d.). Retrieved from http://www.answers.com/topic/hit-rate
Intel Corp. (2012, November 8). Intel® xeon phi™ coprocessor system software developers
guide. Retrieved from
Intel phi inner. (n.d.). Retrieved from http://www.amax.com/hpc/images/intel_phi_inner.jpg
IT 5200 Platforms & OS Lloyd Middlebrooks Xeon Phi Definitions Assignment February 4, 2013
Xeon Phi Definitions
Vector Processing Unit (VPU) The VPU executes instructions that take the form of vectors. Vectors are one-dimensional arrays of data. In a vector processor, a single instruction operates simultaneously on multiple data items (arrays); whereas a scalar processor processes a single data item element (integers and floating numbers) at a time. VPUs are used to perform essential numerical calculations associated with High Performance Computing (HPC). Within the Xeon processor is the Xeon Phi coprocessor. The VPU is located inside each core of the coprocessor as shown in Figure 1. The Xeon Phi's VPU features a 512-bit SIMD (Single Instruction, Multiple Data) instruction set and can execute 8 double precision or 16 single precision operations per cycle in its Arithmetic Logic Unit (ALU) as shown in Figure 2. The VPU benefits the Xeon processor by reducing the amount of fetch and decoding operations that can incur higher energy costs and increase bandwidth.
software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 1
software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 2
IT 5200 Platforms & OS Lloyd Middlebrooks Xeon Phi Definitions Assignment February 4, 2013
Clock The internal clock refers to an internal signal that operates at a constant frequency in a microprocessor that regulates the rate at which instructions are executed. Essentially, the clock controls the time when each step in the instruction cycle takes place and synchronizes various computer components. The clock rate is usually measured in megahertz (MHz) and gigahertz (GHz). The Xeon Phi has a clock speed of approximately 1.1 GHz, meaning that the Xeon Phi can process 1.1 billion instruction cycles per second. A unique characteristic of the Xeon Phi is the that it has a gated clock. This means that when the Xeon processor's workload does not demand assistance from the Xeon Phi coprocessor (all four threads on a core are halted), the clock is gated, shown in red within Figure 3. Subsequently, the core is powered down after the programmed amount of time. This preserves power and ensures the Xeon processor is operating efficiently.
software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 3
Thread Englander (2003) defines a thread as an individually executable part of a process (executable program). Although threads can be scheduled to run separately from other threads in the same process, they also can share memory and other resources with those associated threads. Each core of the Xeon Phi coprocessor can support four threads in hardware. This is referred to as multithreading. During multithreading, the coprocessor often switches between threads (context switching). The threads are passed between the bus highlighted between the red lines in Figure 4.
IT 5200 Platforms & OS Lloyd Middlebrooks Xeon Phi Definitions Assignment February 4, 2013
software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 4 Sources: Clock Speed. (2013). Retrieved February 17, 2013, from http://www.webopedia.com/TERM/C/clock_speed.html Englander, Irv. (2009). The Architecture of Computer Hardware, Systems Software & Networking. Process Control Management (pp. 493). Hoboken: John Wiley & Sons, Inc. Gilbert, H. (2004, December 22). Clock Speed: Tell Me When It Hurts. Retrieved February 17, 2013, from http://www.yale.edu/pclt/PCHW/clockidea.htm Intel Xeon Phi Coprocessor (codename Knights Corner). Retrieved February 17, 2013, from http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights- corner Pratyusa Manadhata, Vyas Sekar. (2013). Vector Processors. Carnegie Mellon University. Retrieved February 17, 2013, from http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/vector.pdf Product and Performance Information. Retrieved February 17, 2013, from http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html Thread. (2013). Retrieved February 17, 2013, from http://en.wikipedia.org/wiki/Thread_(computing) Vector Processor. (2013). Retrieved February 17, 2013, from http://en.wikipedia.org/wiki/Vector_processor
SIMD
Single Instruction Multiple Data is one of the classifications defined by Flyn. It is
based upon the number of concurrent instruction and data streams available in
the architecture. A computer exploits multiple data streams against a single
instruction steam to perform operations which may be naturally parallelized. An
array processor or GPU is a good example. The first use of SIMD instructions was
in vector super computers. Today, most commodity CPUs implement
architectures that features instructions for a form vector processing on multiple
data sets, typically known as SIMD.
Xeon Phi Coprocessor has 512 bit registers for SIMD operations and considered as
one of the key features. High performance codes on the Xeon Phi Coprocessor
utilize these wide SIMD instructions to extract desired performance level. The
best performance will be achieved when the number of cores, thread, and SIMD
or vector operation are used effectively. The best method to take advantage of
the 512 bit wide SIMD instruction is to write using an array notation style. When
array notation is used, the complier will utilize the SIMD instruction set.
SMP
Symmetric Multiprocessing involves a multiprocessor computer hardware
architecture where two or more identical processors are connected to single
shared Main Memory (MM) and are controlled by a single operating system. Most
common multiprocessor systems today use SMP architecture. Usually each
processor has an associated private high speed memory known as cache memory
to speed up the MM data access and to reduce the system bus traffic. Processors
may be interconnected using buses, crossbar switches or on-chip mesh network.
SMP systems allow any processor to work on any task no matter where the data
for that task are located in memory provided each task in the system is not in
execution at the same time. SMP systems can easily move task between
processors to balance the workload efficiently with proper operating system
support. Software programs have been developed to schedule jobs so that the
processor utilization reaches its maximum potential.
Xeon Phi Coprocessor runs Linux. It really is an x86 SMP on –a-chip running Linux.
Every card has its own IP address. VSMP foundation supports the Xeon Phi
Coprocessor. This revolutionary coprocessor is based on Intel architecture and is
designed for highly parallel workloads. They support the Xeon Phi Coprocessor in
two modes: processor-virtualized mode and coprocessor-aggregated mode.
Technology Definitions
GDDR Memory
GDDR5 memoryis used primarily for graphics processing and is a part of the
circuitry on a graphics board (PCI Express). Both GDDR 3 and GDDR4 and are
the current two standards in use and are differentiated by performance and price,
with DDR4 beong the highest performer along with the highest price. This not
related in any way to the numbering of processor storage.
AMD has developed a new graphics memory chip named GDDR5 keeping in suite
with the last two offerings. GDDR5 has twice the speed of GDDR3 due to a wider
bandwidth. It was orrigionally developed by AMD to use with graphocs processing
giving more speed and a higher quality. As you can see from the screen shot below,
Intel not only uses it for traditional graphics processing but has created a high speed
PCI ring interconnect to use for connecting each coprocessor and each core directly.
Refferences:
Intel -Intel® Xeon Phi™ Coprocessor
AMD Corp
ExtremeTech - GDDR5 Memory–Under the Hood
Branch History Table
A Branch Prediction circuit was developed by HP to try and predict which way an IF-
THEN-ELSE branch would prior to it’s execution. It will fetch the instruction of it’s
prediction prior to the actual location counter pointer and the instruction is
“specutavily executed”. If the choice turns out to be wrong, it is reversed and is put
back in the pipeline. The results of these operations are kept in the Branch History
Table, frequently called something similar but not necessarily the same. The larger
and more populated the table, the better the hit ratio of preditions.
I did not find a reference to this in the Xeon Phi documentation but Intel probably has
something similar as the article leads one to believe that this circuit is very commenly
used.
I found another reference in an IBM publication which referred to the whole process as
the Branch History Table and is probably the best source to read if ou have further
interest.
Sources
WIKAPEDIA: Branch predictor
T. J. Watson Research Center: Branch History Table Prediction of Moving Target
Branches Due to Subroutine Returns
Write-Back This is a performance technique where the data is when data is updated the most current copy is updated in cache rather than in processor memory. It is used when the risk of losing current data does not outweigh the gains in performance. This can be contrasted to the technique of of “Write Through” where the data is updated both in cache and main memory. This obviously is safer but also significantly slower. Sources
Microsoft: Write Back - Part of the Backup and recovery glossary
Xeon Phi Definitions
Extended Math Unit (EMU): the EMU is a shared function that deals with more
complex math operation on behalf of several Execution Unit (EU), it provides faster
implementation of the single precision transcendental functions like reciprocal, reciprocal
square root, base 2 logarithms, and base 2 exponential functions using lookup tables. It
performs this operation inside the vector processing unit. It is an implemented hardware
that can also achieve high throughput of 1 – cycle or 2 – cycle other transcendental
functions that can be derived from elementary functions. Relating it to Xeon Phi
coprocessor, it also performs the same function as Intel; it allows operations to be done in
the form of a vector fashion with high bandwidth. And also calculates the polynomial
approximations of these functions . For example, the figure below shows it operation
Function Math Tpt. Cycle Reciprocal 1/x 1 R Sqrt 1/√2 1
Exp2 2� 2 Log2 log� x 1
Figure 1: A simple EMU Calculation
References
http://software.intel.com/en-us/articles/achieving-high-performance-on-monte-carlo-european-option-on-intel-xeon-phi
http://software.intel.com/en-us/articles/case-study-achieving-superior-performance-on-black-scholes-valuation-computing-using
Instruction Unit: Implements the basic instruction pipeline, fetching instructions from
the memory subsystem, dispatching them to available execution units, and maintaining a
state history to ensure that the operations finish in order. It also used in executing
conditional branch and unconditional jump instructions. The Instruction Unit is part of
the Control Unit, which in turn is part of the CPU, which handles all the preparation of
instructions for execution. It is responsible for organizing program instructions that are to
be fetched and executed in an appropriate order. Instruction unit performs a lot of
operations in relation to both Intel and Xeon Phi coprocessor.
Figure 2: Shows how the Instruction Unit operates
This diagram was retrieve from http://openrisc.net/or1200-spec.html
References
http://openrisc.net/or1200-spec.html
http://en.wikipedia.org/wiki/Instruction_unit
Miss: This word is used in cache memory; it is usually called the Cache Miss. Miss is a
situation in which the request is not already present in cache memory, or when accessing
a memory location that is not in the cache. When this happens, the processor will then
wait for data to be fetched from the next cache or cache level or from the main memory
before continuing its execution. That is because cache misses can influence the
application`s performance directly. In relation to Xeon Phi cache miss occurs on a core
that generates address request on the address ring and then quires the tag directories. In
this case if no data is found in the tag directories then the core will have to generate
another address request and then query the memory for data. The diagram below shows
the Miss rate versus cache size on the Integer portion of SPEC CPU2000.
Figure 3: Miss Rate versus cache size on the Integer portion of SPEC CPU2000
This diagram was retrieved from http://en.wikipedia.org/wiki/CPU_cache#Cache_miss
References
http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_li
nux/manual_html/miss_ratio.html