california state university, northridge pcie …
TRANSCRIPT
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
PCIE CONFIGURATION FOR DATA TRANSFER AT RATE OF 2.5-GIGA BYTES PER
SECOND (GBPS): FOR DATA ACQUISITION SYSTEM
A graduate project submitted in partial fulfillment of the requirements
For the degree of Master of Science in
Electrical Engineering
By
Avani Dave
May 2013
ii
The graduate project of Avani Dave is approved:
Dr. Emad El Wakil Date
Dr. Somnath Chattopadhayay Date
Dr. Nagi El Naga, Chair Date
California State University, Northridge
iii
ACKNOWLEDGEMENT
I wish to express my appreciation to those who have served on my graduate project.
Firstly, I would like to thank Dr. Nagi El Naga for his valuable advice and guidance throughout
my project. His constant support and encouragement has helped me learn a great deal all through
the project. He was instrumental in providing not only all the guidance but also inspiration that I
needed.
I also wish to acknowledge special appreciation to Information and Technology Department’s
Mr. Emil Henry and Mr. Armando Tellez for their help and support for driver’s installations.
Special thanks to Dr. Somnath Chattopadhyay and Dr. Emad El wakil for their valuable
comments on my work.
I also wish to acknowledge special appreciation and thanks to the best and brightest engineers of
Xilinx help and supports, who have helped me throughout my project. Their input, comments
and guidance helped immensely in learning the finer details of the subject, grasping new ways of
learning concepts and successfully completing this project.
Finally, the patience and support from my parents, family and friends has been enormously
important to me while I have been engaged in the graduate project. Thanks to all of you.
iv
TABLE OF CONTENTS
SIGNATURE PAGE .......................................................................................................... ii
ACKNOWLEDGEMENTS ............................................................................................... iii
LIST OF FIGURES ............................................................................................................ v
ABSTRACT ...................................................................................................................... vii
CHAPTER 1 INTRODUCTION…………………………………………………………1
1.1 Basic concept of data acquisition system…………………………………..………….1
1.2 Objective.. ..................................................................................................................... 2
1.3 Project Outline .............................................................................................................. 3
CHAPTER 2 DATA ACQUISITION SYSTEM FOR 32X32 PHOTO DETECTOR
ARRAY…………………………………………………………………………………...4
2.1 Top Level Architecture ................................................................................................. 4
2.2 Detailed Design Implimentation ................................................................................... 5
2.3 Specification and Features ............................................................................................ 6
CHAPTER 3 INTRODUCTION TO PCIE……………………………………………..9
3.1 Background of Computer Bus Systems ........................................................................ 9
3.2 Why to choose PCIE ................................................................................................... 11
3.3 PCIE Basics ................................................................................................................ 12
3.3.1 Working principal ……………………………………………………………….12
3.3.2 Differential Signaling............................................................................................ 13
3.3.3 Lanes ..................................................................................................................... 14
3.3.4 Transmission Rate ................................................................................................. 15
3.4 LogiCore PCIE Block………………………………………………………..............15
3.5 Protocol Layer......……...…………………………………………………………….17
CHAPTER 4 IMPLIMENTATION OF PCIE…………………………………………..18
4.1 Hardware Setup ........................................................................................................... 18
4.2 Driver’s Installation .................................................................................................... 19
4.3 Software Installation ................................................................................................... 20
4.4 Logic Core PCIE Generation..…………………………………………….................22
4.5 Programming ML605……………………………………………………………..….30
CHAPTER 5 TESTING AND VERIFICATION……………………………………….37
5.1 PCIE Functional Testing……………………………………………………...............37
5.2 Simulation ................................................................................................................ ...49
CHAPTER 6 CONCLUSION AND FUTURE SCOPE .................................................. 54
REFERENCES ................................................................................................................. 55
v
LIST OF FIGURES
Figure 1.1 Block diagram of Data Acquisition System…………………………………...2
Figure 2.1 Top Level Design of Data Acquisition System…………………………..........4
Figure 2.2 Detailed Design Implementation of data Acquisition system…………………5
Figure 2.3 Pulse Data Transfer……………………………………………………………6
Figure 3.1 Various buses connected to the CPU............................................................... 10
Figure 3.2 Comparison of technology and data transfer rate….………………………....11
Figure 3.3.1 PCIE and motherboard socket connector……………………………..……12
Figure 3.3.2 A Differential signal pair and subtract or…………………………………...13
Figure 3.3.2 B signal pulse and noise subtraction. ........................................................... 13
Figure 3.3.3 A PCIE x1 four-wire lane configuration....……………………..…………..14
Figure 3.3.3 B PCIE lane connectors with speed grad ..................................................... 14
Figure 3.4 Functional Block Diagram and interfaces for logiCORE IP ........................... 15
Figure 3.5 Protocol Layers ................................................................................................ 17
Figure 4.1 PCIE x 8 slot connected to ML605 (Lab server) ............................................. 18
Figure 4.2 Driver’s installation and com-port opening…………………………………..19
Figure 4.3 A Tera-term connections ................................................................................. 20
Figure 4.3 B DIP switch S1 setting (1000) ....................................................................... 20
Figure 4.3 C Compact Flash Card Insertion…………………………………………...…21
Figure 4.3 D Built In System Test menu (BIST) .............................................................. 21
Figure 4.4.1 A Core Generator-New Project .................................................................... 22
Figure 4.4.1 B Select Device Parameter ........................................................................... 23
Figure 4.4.1 C Customizing PCIE Core…………………………………….……..……...23
Figure 4.4.1 D Clock Parameter Setting ........................................................................... 24
Figure 4.4.1 F Base Address Register(BAR) setting ........................................................ 25
Figure 4.4.1 G Vendor ID Setting………………………………...……………………..26
Figure 4.4.1 H Xilinx development Board ML605 selection ........................................... 27
Figure 4.4.1 I Reference Clock Frequency Select ............................................................ 27
Figure 4.4.1 J PCIE core generate .................................................................................... 28
Figure 4.4.1 K Project IP……………………………………………………………..…..28
Figure 4.4.1 L Script to Generate Core ............................................................................. 29
Figure 4.4.1 M Script to Make routed.bit ......................................................................... 29
Figure 4.5 A S1 an S2 switch settings .............................................................................. 30
Figure 4.5 B IMPAC ...………….......................................................................................31
Figure 4.5 C PROM .................................................................................................. …..31
Figure 4.5 D Setting PROM Parameters………………………………………………….32
Figure 4.5 E Setting PROM Parameters ........................................................................... 32
Figure 4.5 F Setting PROM Parameters………………………………………………….32
Figure 4.5 G loading.bit file .............................................................................................. 33
Figure 4.5 H BPI settings .................................................................................................. 33
Figure 4.5 I Generate File…………...…………………………………………………...34
Figure 4.5 J Boundary scans……………………………………………………...............34
Figure 4.5 K Initialize chain……..……………………………………………………….35
Figure 4.5 L Select Device xc6vlx240t……………………………………………….….35
Figure 4.5 M SPI/BPI PROM setting…………………………………………………….36
vi
Figure 4.5 N Programming Flash…………………………………………………………36
Figure 5.1 A PCI TREE…………………………………………………………………..37
Figure 5.1 B Intel’s x8086 PCIE bridge device ……………….………………………....38
Figure 5.1 C Xilinx’s ML-605 device…………………………………………………….38
Figure 5.1 D Configuration registers……………………………………………………..39
Figure 5.1 E Configuration registers editing……………………………….…………….40
Figure 5.1 F Configuration Register Read……………………………………………….41
Figure 5.1 G Memory read (BAR)……………….……………….………………………42
Figure 5.1 H Memory read (Content)……………………………………………………43
Figure 5.1 I Memory write……………………………………………………………….43
Figure 5.1 J Memory write using file…………………………………………………….44
Figure 5.1 K Memory writes using count value…………………………...……………..45
Figure 5.1 L Image Transfer………………………………………….…………………..46
Figure 5.2 A Design compiling………………………………………….………………..47
Figure 5.2 B Detailed Analysis Report ............................................................................. 48
Figure 5.2 C Configuration Register Read ....................................................................... 49
Figure 5.2 D BAR Data Read………………………………………………………….…49
Figure 5.2 E Data TLP……………………………………….………………….………..50
Figure 5.2 F Physical Layer Memory Read Packet……………….……………………...50
Figure 5.2 G Physical Layer Data TLP ............................................................................. 51
vii
ABSTRACT
PCIE CONFIGURATION FOR DATA TRANSFER AT RATE OF 2.5-GIGA BYTES PER
SECOND (GBPS): FOR DATA ACQUISITION SYSTEM
By
Avani Dave
Master of Science in Electrical Engineering
In a modern era, most of the consumer and industrial electronic devices use High Speed
data transfer. High speed data transfer can be achieved with different protocols like I2C, SPI,
AGP, PCI, and PCIE. A 32x 32 photo detector data acquisition system is a group project which
needs high speed data transfer at approximate rate of 2.5 Giga Bytes per Second (GBPS).
This project is part of this group project, which includes the research work of selecting right
protocol, hardware and software requirement for such high speed data transfer. Followed by,
detailed configuration and implementation of Xilinx’s Virtex 6 FPGA integrated block for PCI
Express v1.3 with x8 Gen 1. Some testing and verification on the PCIE core using-- Xilinx’s
chip scope pro, ISE simulation, DMA performance demo application xapp1052 and a third party
tool PCITREE are also performed. The core is generated and simulated in Verilog. The PCIE
core performance achieved is 2.3 GTPS for read and 2.8 GTPS for write using DMA
performance demo xapp1052. This module and the instantiated core can be used at five different
levels for data transfer at rate of ~2.5 GTPS for the main project after making EZDMA module
and main memory mapping.
1
CHAPTER 1
INTRODUCTION
High speed and controlled data processing and communication are the demand of today’s users.
The process of collecting, processing, controlling and transferring data all together in easy terms
called “Data Acquisition System”. Speed is of the essence, when it comes to transferring data in
any electronic device/gadget today. Also, today’s user wants more controlled data at the fastest
speed of communication. Nowadays, many high speed data transfer protocols are available such
as fiber channel, Peripheral Component Interconnect (PCI), Peripheral Component Interconnect
Express (PCIE), Accelerated Graphics Protocol (AGP), Serial Peripheral Interface (SPI),
Universal Serial Bus (USB), wireless, Ethernet. User can select the protocol for communication
based on the application and speed needed for data transfer. Selection of right protocol is one of
the essential building blocks for high speed data transfer, which in consequences also one of the
major requirements for any data acquisition system.
Our group project is about 32 x 32 Data Acquisition systems and in that we need high speed data
transfer at approximate rate of 2.5-Giga Bytes per second (GBPS). By taking part in this group
project, we had a chance to learn new technologies available for high speed data transfers in
market, and make their practical implementation working for our project. Also, we had an
opportunity to apply all our digital fundamentals, programming knowledge, and problem solving
skills, which we gained during our entire course work of master engineering at California State
University, Northridge.
1.1 Basic concept of data acquisition system
The data acquisition system consists of three main functions. First is to collect data. The data
could be manmade or naturally occurring like temperature, humidity, wind speed, image. Data is
usually analog in nature. Transducers / sensors are used to get data in the form of electrical
signals. Second function is to convert data to the appropriate format. Better immunity to noise,
easy to storage, reproduce and process at high speed data transfer are main advantages of digital
data. The analog electrical signal needs to be converted into digital form. This is done by
different bit rate and speed grade analog to digital converters (ADC). Third function is to -
control, processing and transferring data. The data controlling and processing in most cases is
2
done by -personal computers, servers or controller units. Data transfer is done by various
methods or protocols like I2C, SPI, USB, Wi-Fi, Ethernet, PCI, PCI-Express (PCIE) etc.
server
Data processing Transducer DataADC
Figure 1.1 Block diagram of Data Acquisition System
Figure 1.1 is a simplified block diagram of data acquisition system. Sensors or transducer which
converts the physical phenomena to electrical signals i.e. Voltage or current or into some
electrical property like resistance or capacitance. The electrical signal is the converted to digital
form by using Analog to Digital Converter (ADC). After which, it can be used for further
processing and transferring at high speed. The Digitized multi-bit data is stored in to some
memory or given to control and processing unit like computers.
Depending on the application, various algorithms are used to process and represent the data in a
meaningful way. Depending on the application and cost, the size and complexity of data
acquisition systems varies. For applications such as medical, military or space accuracy of the
system is extremely critical. Therefore, these systems have complex signal conditioning and use
a wider resolution to represent analog data. In another kind of applications, like consumer
appliances or vehicles, cost is more valuable than accuracy. In addition to the above mentioned
applications, data acquisition systems are found in places such as monitoring weather or other
geological phenomena, photography, sport, testing, verifying and controlling prototypes,
robotics, communication monitoring and controlling industrial machines.
1.2 Objective
The objective of this graduate project is to design and implement Xilinx’s virtex6 FPGA based
PCIE interface to get data transfer rate of 2.5-GBPS on ML-605 evaluation kit.This can be used
for our group project of data acquisition system. Here, we got a chance to learn about PCIE IP
core. We had an opportunity to configure the hardware and software setup for PCIE core
generator IP. Initially, we have to research on minute details, starting from hardware and
3
Operating System (OS) selection needed for making PCIE work, the drivers and software
installation, configuration of IP core’s internal registers, selecting lane width. Finally, testing and
verification of the PCIE core module. We are going to use this PCIE core generator module at
five different places in the main project for high speed data transfer.
This project is a group effort, where we can apply all the concepts and logic design knowledge,
which we learnt during my master’s course work to implement the system. The main project is
data acquisition system, which is capable of reading and controlling data from an array of 32 X
32 photo diodes. Which is then given to ADC, this system generates 12-bit digital data for each
analog sample. We have to rescan the 32 x 32 bit frame 18 times in one second. So, very large
amount of data would be generated by such a system, which needs to be processed at high speed.
Also, In between two bursts of 18 frames there should be a significant amount of time to transfer
and process the collected data from on board memory to a computer. The details specifications
are discussed in the next chapter.
1.3 Project Outline
This report is organized as follows. The project report begins with a general description of data
acquisition systems, the objective of doing this project and outline of project report. In Chapter
II, Main Data Acquisition System project, its top level and detailed design is discusses and some
of the specifications and features were listed. Chapter III includes details of PCI and PCIE data
transfer. In Chapter IV, implementation of PCIE interface on ml605 virtex 6 Xilinx’s FPGA was
explained in details. Chapter V focused on testing and transferring data using PCIE. Chapter VI,
summarizes the results of newly proposed PCIE methods and provides some thoughts on future
work.
4
CHAPTER 2
DATA ACQUISITION SYSTEM FOR 32 X 32 PHOTO DETECTOR ARRAY
The Project of 32 x 32 photo detector array based data acquisition system is started with the
research work of selecting best architecture, devices-components, and protocols for the design.
Main goal here is to collect data from 32 X 32 photo detector array and store it in levels of FPGA
for future processing and then transfer it at ~2-GBPS. This chapter gives you overview of our
main group project “32 x 32 data acquisition system” by providing top level design and detailed
architectural details and some of the important features and specifications.
2.1 Top Level Architecture
Figure 2.1 shows top level architecture of data acquisition system. Complete data from 32 x 32
photo detector array is called one frame. The frame is divided in to four segments. These
segments are called segment-0, segment-1, segment-2, and segment-3. Each segment has 4 x 4 =
16-photo detectors modules. Each module has 4 x 4 = 16-photo detectors.
Figure 2.1 Top Level Architecture of Data Acquisition System
5
As it can be seen from figure 2.1, each segment has its own FPGA for processing and controlling
data. Each photo detector’s output is given to respective 12-bit analog to digital converter.
Outputs from ADCs, from respective photo detector modules and segments, are given to
respective FPGA’s BRAMs as indicated in figure 2.1 by FPGA blocks as 1, 0, 2 and 4. Then,
based on the 4 x 1 multiplexer logic of level-2 FPGA, any one FPGA from level-1 is selected to
transfer data for selected row, and that data is stored in second level FPGA. Then, the data is
read and transferred using PCIE at the rate of ~2.5-GBPS to server for monitoring.
2.2 Detailed Design Implementation
FPGA-0 controller
4X1 MUX
PCIEP
CIE
2.5GBPS-DATA CLK
BRAM
CLK Generator
adad
EZDMA
ADS5281-ADC
PC
IE
SPI
CLK Generator
DATA
144MHz-CLK
12MHz-CLK
12-bit digital DATA,144MHz
DATADATA
CLK
CLK
BRAM
CLK
CLK
DATACLK
CLK
DATA
FPGA-0 controller
Level-2 controller
DATA
32x32 photo detector array
Person server to control data
Figure 2.2 Detailed Design Implementation of data Acquisition system
6
The detailed design implementation of the project is illustrated from Figure 2.2. Firstly, the data
is scanned from photo detector and send via SPI-bus to ADS 5281. Secondly, the EZDMA
control component was configured to use a TLP load size of 256-bytes at 250-MHz frequency. it
also supports four outstanding requests namely -- one DMA channel, one local address width of
16-bit, local memory read latency of one clock cycle, and communication link control to PCIE
bus. Third task is to configure the PCIE IP core to perform FPGA to FPGA data transfer at the
rate of ~2.5-GTPS with virtex-6 based ml605 PCIE x8 Gen1. The same procedure is repeated for
all four FPGAs (segments of photo detector arrays).
On the second-level FPGA, the main function is to receive the data via PCIE bus from all four
FPGA’s namely FPGA-0, FPGA-1, FPGA-2, and FPGA-3 based on the 4 x 1 multiplexer
selection logic. So, that accurate line information will be reproduced. This data will be stored on
BRAM. Data can be transferred with the required clock speed via PCIE for final reproduction on
screen.
2.3 Specifications and Features
We have 32 x 32 array of photo detectors spaced at 5-mm, pulse comes in, to initiate sampling at
about 1-kHz. We are taking 18-samples of each frame. Each frame is having data from 32 x 32
detector detectors. Which are captured at 3-MHz. Data must be sampled, digitalized, read and
send out before the next data set begins. Data will be mixed into a single data stream by a data
recorder at about ~2.5-GBPS. It takes 1.5-us to collect 18-samples frames as shown in Figure
2.3.
Figure 2.3 Pulse Data Transfer
1.5µs to take 18 sample frames *
7
Each module has a 4 x 4= 16 photo detector. So, each segment has 16 x 16 = 256-photo detectors
information and such 4-segments are there, in total 256 x 4 = 1024-photo detectors information
there in one frame. Each such frame is scanned 18-times, which means, total of 1024 x 18 =
18432-information packets are given to analog to digital converter IC-ADS5281.The IC-
ADS5281 is 12-Bit Octal-Channel ADC at 65-MSPS. It converts one analog input signal to 12-
bits digital output, which makes 18432 x 12 = 221184-bits total data, for each sample frame. The
data must be sampled, digitalized and read out before the next data set begins. Data should be
transmitted at ~2.5-GBPS on a single stream.
In total 32 x 32 photo detectors, each will generate 12-bit digital values to complete one frame
scanning. This is repeated 18-times as we are scanning each frame 18 times. Thus, total of
221184-bits per sample will be generated. This data is transmitted at 2-GBPS = 2,147,483,648
bits per second rate.
8
It takes 12.875-micro seconds to transmit 18-sample frames of data as per the above calculation.
From this, we can calculate the time each FPGA has for processing data. As we have 1-msec
/sample pulse, 1.5-usec to collect 18-frames and it takes 12.9-usec for transmitting this data on
serial interface at 2-Giga bits per second rate (GBPS). Data acquisition system has 0.9999-
Seconds to process data generated during each sample pulse.
Lastly, let’s calculate the number of bits to be processed by each FPGA during this time, which
is calculated as follow. Data per FPGA segment is 55,296-bits per sample pulse.
0.0010000 (1 msec /sample pulse) - 0.0000015 (1.5 usec to collect 18 frames) - 0.0000129 (12.9 usec for serial 2 Gbyte) - ---------------------------------------------------------
0.99985 seconds to process collected data
18 samples
12 bits per sample
X 256 photo detectors per fpga
------------------------------------------
55,296 bits per fpga per sample pulse
9
CHAPTER 3
INTRODUCTION TO PCIE
In today’s modern era of communication, use of high speed data transfer system is must. For
high speed data transfer, keyed features are - the data transfer rate, method of communication
and configuration protocols. From which, data transfer rate is mainly dependent on the data
transfer protocol and the method of communication. These days, all high speed data
communication are digital. Digital data communication method is more secure and less
interfered by noise. The important thing is to select the protocol used for communication.
Protocol is essentially defined as, “set of rules”. So, it refers to -set of rules for data transfer.
There are many protocols like SPI, I2C, PCI, PCIE, and USB. The data transfer rate is also
reliant on the configuration of particular protocol used like PCIE Gen1 can transfer at a rate of
2.5-GTPS and Gen2 can transfer at a rate of 5.0-GTPS, when configured for x1 lanes. If we
increase the number of lanes the bandwidth will be increased accordingly.
Our project requirement is the serial data transfer at high speed of approximately 2.5-Giga Bytes
per Second (GBPS). So which protocol is to use can be finalized after having an understanding
of different computer bus system and data transfer rates for different protocols. In this section,
we are going to study basics of computer bus systems, why to select PCIE, some basic concepts
of PCIE like working principal, differential signaling, lanes and transfer rates, logiCore IP details
and working, protocol layered architecture and protocol overhead.
3.1 Background of Computer Bus Systems
20- 30 years back, the bus and the processor runs at the same speed, thus, they are able to
synchronize. Synchronization was also the reason in old computers to have only one bus. But
today, the processors speed increase exponentially with every new generation computer system
launch. So, to communicate with high speed processor, this day’s most of the computer systems
have two or more buses for a specific kind of traffic handling. Figure 3.1 shows how various
buses were connected to the CPU.
10
Figure 3.1 Various buses connected to the CPU
A classic PC has mainly two types of buses: type one is called as “the system bus or local bus”,
which connects the system memory and the microprocessor (central processing unit). This is also
the fastest bus and sometimes called as backend bus as it is close to CPU and system memory.
Another type of bus is a slower one, which communicates with hard disks, sound cards etc. This
are also called front end bus, as it is near to interfacing device. These slower buses connect to the
system bus, through a bridge for transferring and integrating data from the other buses to the
system bus. PCI bus standard is one of these kinds of frontend bus system.
There are other buses standards as well For example, to connect things like -cameras, scanners
and printers to your computer the Universal Serial Bus (USB) connection is the simplest and
cheapest way. It works on thin wire sharing configuration so that many devices can be connected
to that simultaneously. Another example for video cameras and external hard drives connection
is Firewire bus is mostly used in today’s time.
11
3.2 Why to choose PCIE
The data acquisition system project needs high speed data transfer at all levels and between
FPGA’s at approximate rate of ~2-GBPS. So, we had researched on best available bus standards-
protocol and data transfer rates. Figure 3.2 shows the tabular comparison of technology and data
transfer rate.
Figure 3.2-Comparison of Technology and Data Transfer Rate
From above table, it is seen, that PCI Express1.0(x1 link) can fulfill project need.
12
3.3 PCIE Basics
The General purpose IO interconnect standard is called, Peripheral Component Interconnect
Express (PCIE), which is enhanced feature version of PCI bus standard and more economical
than PCI-X. Peripheral Component Interconnect Express -as the name implies this is a peripheral
device interconnect bus standard. PCIE replaces parallel bus architecture of older version, such
as PCI and PCI-X, with new scalable serial point to point interface with packet base
transmissions.
A high-speed serial connection, which can operate more like a network rather than a bus, is
called PCI Express. PCIE has a switch, which controls several point-to-point serial connections,
which are primarily output from a switch, pointing straight to the devices point, where data needs
to go. Each device has its own committed connection. PCIE has no bandwidth sharing as normal
bus.
3.3.1 Working Principal
When the computer starts up, PCIE determine which devices are connected into the mother
board, establishes the links between them. It direct the flow of traffic and negotiates the width of
each link. The identification of devices and connections is carried out by drivers for the PCIE
device. Figure 3.3.1 shows the PCIE socket and motherboard connection for PCIE x1 and PCIE
x 16 add-in card connectors.
Figure 3.3.1 PCIE and motherboard socket connector
13
3.3.2 Differential Signaling
The PCIE uses differential signaling technique, which uses two transmission lines for sending
one signal. These two signals have positive and negative voltage levels respectively. The
information signal is transmitted in positive and negative signals and at the receiver side they are
subtracted to get original signal. This technique is highly effective for noise cancellation. Figure
3.3.2 A shows the differential signaling technique and Figure 3.3.2B shows how noise will be
subtracted because of use of differential technique. This, two wire configuration is called lane or
links in PCIE.
Figure 3.3.2 A Differential signal pair and subtract or
Figure 3.3.2 B Signal pulse and noise subtraction
14
3.3.3 Lanes
The PCIE bus system has two pairs of wires. One wire to send and another to receive packets
forms Lane. Packets of data move at a rate of one bit per cycle across the lane. The smallest
PCIE connection has one lane, made up of four wires as shown in the Figure 3.3.3 A.
Figure 3.3.3 A PCIE 4 wire lane configuration for x1
As PCIE is serial data transfer protocol. But it has separate path for individual signal, which is
called as lanes. Depending on the motherboard and PCIE socket connection, they are limited like
-x1, x2, x4, x8, x16, and x32 etc. Figure 3.3.3 B shows the socket types for few lanes like x1, x4,
x8and x16 respectively with speed grads
Figure 3.3.3 B PCIE Lane connectors with speed grad
15
3.3.4 Transmission rates
A PCIE bus has raw data transmission rate of 2.5 Giga transfers per second (GTPS) in each
direction. The aggregate raw bandwidth of a link can be calculated using 2.5 GT/s multiplied by
the number of lanes. Second generation PCI Express devices (version 2.0 or higher) may
optionally transmit at 5.0 GT/s per lane, but are backwards compatible with the first generation
transmission rate.
3.4 LogiCore PCIE Block
Xilinx Virtex6 has Integrated Block for PCI Express. Before, going to implement it using IP core
generator in the next chapter, the main part of the project, here is to understand detailed
architecture of PCIE and how it Functions. Figure 3.4 A shows the Top level Functional Block
Diagram and interfaces for logiCORE IP Virtex6 FPGA integrated Block for PCI Express Core.
Figure 3.4 Functional Block Diagram and interfaces for logiCORE IP
16
The LogiCore IP v6 for PCIE core internally instantiates integrated block for PCIE (PCIE_2_0),
which consists of Physical, Data link and Transaction layers based on PCIE base specification
layering model. The PCIE block, when invoked, generates five interface modules namely,
1) System (SYS) interface
2) PCI Express (PCI_EXP) interface
3) Configuration (CFG) interface
4) Transaction (TRN) interface
5) Physical layer (PL) control and status interface
The system interface consists of reset and system clock. The reset signal is asynchronous active
low. The system clock should be 100-MHz, 125-MHz, or 250-MHz. The PCI Express
(PCI_EXP) interface consists of differential signals pair for transmitting and receiving of data on
multiple lanes. The Transaction (TRN) interface generates and process the transactions for each
lanes.The Physical Layer (PL) interface checks the status of the links and provides control of link
transfers.
The data transfer between the modules inside the core is done using information packets. These
packets are generated to convey necessary information from the transmitter to the receiver at
Data Link layer and Transaction layers. Necessary header’s bit length and parity information are
added from these layers for secure communication between transmitting and receiving
components. At the receiver side all the layers who receive the information packets -processes
the packets, strips it and transfers to the next layer based on the information it has in the data link
header. As a result, the physical level information signal is converted into the Data Link layer
information packet and then into the transaction layer information packet.
The data transfer is performed using requests and completions. PCIE transactions have four basic
types of requests; message, configuration, I/O, and Memory. Interfacing device sends the -IO
read/write, memory read/write, and configuration read/writes requests and the PCIE device
respond to it by sending completion signal. PCIE device can use Base Address Register (BARs)
to reserve memory block in the host system’s memory map. When OS assigns the address to the
block, and then BARs are programmed with these addresses. For design, we used BAR0 with the
size of 1 Megabyte.
17
3.5 Protocol layers
The PCIE layered structure’s basic data flow is shown in Figure 3.5. As it is seen from the
figure, PCIE is bidirectional data transfer protocol. When user initiates a data transfer from the
source then data from the specified memory location is read by transaction layer, and it adds
error checking or parity bits and the header, which points to the destination location to data.
Thus, data packet is generated and transmitted to data link layer. Then, the data link layer adds
the destinations Mac address and sequence information bits to the data packets and transfer it to
the physical layer in the form of bits. The physical layer transfers the data inform of bits in serial
mode based on the number of lanes selected and speed grades. At the receiver side, all reverse
presses will take place.
Figure 3.5 Protocol Layers
18
CHAPTER 4
IMPLIMENTATION OF PCIE
In this chapter, the hardware-software implementation, issues and solutions are discussed for
PCIE core setup to get data transfer rate of 2.5-Giga Transactions per second. The complete
implementation is divided in three parts namely hardware requirements and setup, second is
drivers’ installation, and third is software installation and IP core installation and generation.
4.1 Hardware
First we need a server or a computer system, which should have x8 PCIE slot. As our project
requirement says, we need to get ~2.5-Giga Bytes per second data transmission rate. For that we
need to use PCIE x8 configurations, and we need x8 slot for physical connection. Figure 4.1
shows lab server’s PCIE x 8 slot connected to ML-605 board.
Figure 4.1 lab server’s PCIE x 8 slot connected to ML605
19
Other hardware requirements are Xilinx evaluation board ML-605, a server with windows Xp sp-
3 installed, updated and working, Programming cable to program the ml605 board. Connect one
USB Type-A to mini-B 5-pin cables from your PC to J21 on the ML-605 board.
4.2 Driver
For Xilinx’s ML-605 board to be detected by the server, proper driver should be installed. Install
the latest CP210x VCP Win2K/XP/2K3 Drivers for Server from www.silabs.com. make sure that
the proper version and operating system should be selected. For our case, we had wrong drivers,
and it took us few days to find this right link.
Follow the driver installation steps from the user guide ug533.pdf of ML-605. Once the driver is
installed, you can select silicon labs CP210x USB to UART Bridge (COM3).from there do right
click and go to > properties >advance setting from device manager and make COM3 port open
for communication.
Figure 4.2 Driver installation and com-port Opening
20
4.3 Software
Once the hardware is setup and driver is installed, now the time is to configure the device for
PCIE use. Before that just to make sure, that Xilinx ML-605 is connected and working properly.
Use any terminal connecting tool to connect to com-port3 of the system using ssh. We are using
tera-term. Shown in Figure 4.3 tera-term connection and port baud rate settings.
Figure 4.3 A Tera-term connections
Make the DIP switch s1 on ML-605 as 1000 (position 4 to position1) for making compact flash
demo designs working as shown in the Figure 4.3 B DIP switch S1 setting.
Figure 4.3 B DIP switch S1 setting (1000)
21
Insert the compact flash card to ML-605 as shown in Figure 4.3 C and press Sw3 switch the
ACE system reset push button on ML-605.
Figure 4.3 C Compact flash card insertion
If everything is correct, compact flash card contents built in system test designs which can be
used to verify system board functionality. You will be pointed to the screen, from where you can
run the demo programs on ML-605 for checking functionality of different modules. This screen
will look like Figure 4.3 D Built in System Test (BIST).
Figure 4.3 D Built In System Test menu (BIST)
22
4.4 Logic Core Generations
For easy to implement logic cores, Xilinx provides CORE Generator tool to generate code that
instantiate the core. I used CORE Generator for modules instantiations of PCIE core which
includes BRAM core and Clock Generator core. This section describes the instantiation of this
cores and configuration points, which need to be taken care for complete setup to work.
4.4.1 Xilinx PCIE core
For generating PCIE core, follow the steps given in xtp044.pdf. Here, I am focusing only on the
main steps. Open Xilinx core generator from the start menu and make a new project folder and
save everything in this folder. Figure 4.4.1 A shows new project screen for core generator.
Figure 4.4.1 A New Project screen Core Generator
Next make sure, you select the right device and speed grade as shown in Figure 4.4.1 B. We are
using Xilinx ML-605 evolution kit. It has virtex6 xc6vlx240t device with FPGA package of
ff1156 and speed grade -1. If any of the parameter is miss-matching, the core will not work.
Figure 4.4.1B shows the device parameter selection for core.
23
Figure 4.4.1 B Device Parameter Select
These are highly important configuration settings for PCIE core to work properly, and we can
configure various –speed grades, lanes, memory starting location, base address register
initialization setting from here.
Figure 4.4.1 C PCIE core customizing
Select virtex6-integrated block for PCI Express version 1.3. As indicated in
Figure4.4.1C.Customize the core for number of lanes =8.data transfer rate is 2.5GTPS for
24
generation-1 PCIE device core and 5.0-GTPS for generation-2 PCIE devices core. Select
interface clock frequency to 250-MHz. Figure 4.4.1 D shows parameter settings.
Figure 4.4.1 D Parameter Setting
25
Then, select Base Address Register-BAR0 for BRAM starting location, width and size as shown
in the Figure 4.4.1 F. Here, we have used 32 bit width and 1-Mega byte size is selected. Uncheck
all other BAR registers like BAR1, BAR2 etc. BAR serves two purposes, initially they server as
a mechanism for the device to request blocks of address space in the system memory map. After
the bios determine what address to assign to the device, the BARs are programmed with
addresses and the device uses this information to perform address decoding.
Figure 4.4.1 F Base Address Register setting
26
Next step is to select vendor ID 10EE, Device ID 6018, Revision ID 00, Subsystem Vendor ID
10EE, subsystem ID 0007. Figure 4.4.1 G shows all the necessary settings to be made for core
generator to generate PCIE v1.3 core. Leave all other parameters with default settings and move
to page 9.
Figure 4.4.1 G Vendor ID
27
From here, you can select the hardware board on which you want to dump the IP core. In product
selection menu select ML-605 as shown Figure 4.4.1 H screen short below
Figure 4.4.1 H Xilinx development Board selection
The last step, is to select reference clock frequency. The Integrated block for PCIE allows you to
select reference clock frequency from 62.5-MHz, 125-MHz, and 250-MHz (etc. since). we had
selected the clock frequency of 250-MHz for BRAM data. We are selecting the same frequency
for reference clock here as indicated by Figure 4.4.1 I.
Figure 4.4.1 I Reference Clock Frequency
28
Finally, generate the core. Once the coreis generated, following screen Figure 4.4.1 J will display
to indicate that core gen has generated the core.
Figure 4.4.1 J PCIE Core generate
You can see the virtex-6 PCIE v1.3 core generated from project IP menu as shown in Figure
4.4.1 K and instantiate the component for further use.
Figure 4.4.1 K Project IP
29
Run the following two scripts from the command prompt to implement the core and make the
routed.bit File, which needs to be programmed on ML-605 board for using the IP core
component of PCIE.
Figure 4.4.1 L Script to Generate Core
Figure 4.4.1 L Script will synthesis and implement the PCIE virtex 6 v1.3 IP core design. The
simulation results files are generated in a new folder name results.
Figure 4.4.1 M Script to make routed.bit
30
Figure 4.4.1M script will make routed.bit, A bit map file for using it to load the ip core on ML-
605.
4.5 Programming ML-605
As the routed.bit fileis generated, it needs to be loaded on the ML-605 board. Follow the steps to
program ML-605 with routed.bit file.
Connect Mini-B cable to the USB JTAG connector to USB type-A on ML-605 board.
Set the S1 Switch to 0xxx(x=don’t care, position4position1) this disables the compact
flash.
Also, set the S2 to 011001 (1=on, position6position1) this selects slave select MAP
(position 5, 4 and 3), platform flash (2) and EXT CClk (1, for PCIE compliance). Figure
4.5 A shows the switch s1 and s2 setting.
Figure 4.5 A S1-S2 switch settings
31
Now, run impact from the results directory, which is generated, during implementation part.
Figure 4.5 B shows impact running from the command prompt.
Figure 4.5 B IMPACT
This will launch impact to program ML-605 FPGA. Select prepare PROM file as shown in
Figure 4.5 C.
Figure 4.5 C PROM
32
Now select PROM file parameters as shown in the screen-shorts in Figure 4.5 D and Figure 4.5
E, Figure 4.5 F respectively.
Figure 4.5 D Setting PROM Parameters
Figure 4.5 E Setting PROM Parameters
Figure 4.5 F Setting PROM Parameters
33
Add the routed.bitfile, to design as shown in Figure 4.5 G.
Figure 4.5 G loading routed.bit File
Select the Multi boot BPI revision and data files assignment default values, as shown in Figure
4.5 H.
Figure 4.5 H BPI settings
34
Select operations menu and generate file as shown in Figure 4.5 I.
Figure 4.5 I Generate file
Select Boundary scan from IMPACT Flows menu in Figure 4.5 J
Figure 4.5 J Boundary scans
35
From the file menu select initialize chain, which will prompt you to select the device for loading
.bit file. Figure 4.5 K shows initialize chain.
Figure 4.5 K Initialize Chain
Now select the device xc6vlx240t and right click to add spi/bpi flashes to it as shown in Figure
4.5 L.
Figure 4.5 L Select Device xc6vlx240t
36
Make the SPI/BPI PROM settings as shown in Figure 4.5 M
Figure 4.5 M SPI/BPI PROM setting
Finally, right click the flash and program it. Figure 4.5 N shows programming the flash. Make
sure to check the erase before programming option.
Figure 4.5 N Programming flash
Finally, we wrote VHDL code to instantiate the PCIE core and use for our application. The code
and all required software and generated files are provided with the CD attached in the project.
37
CHAPTER 5
TESTINGAND VERIFICATION
The testing and verification of the PCIE core component is done in two ways. First way is the
PCIE functional testing and another is simulation. The main goal here is to check that it is giving
data transfer rate of 2.5-GTPS.
5.1 PCIE Functional Testing
Here, I used a third party tool called PCIE tree to check that, PCIE core can be configured at the
register level and I can write and read the data from memory using this core.
Download the tool from the website www.pcitree.comand run.exe. Then from the start menu
open PCIE tree as shown in Figure 5.1 A
Figure 5.1 A PCI TREE
38
This tool will show all the devices connected to the system board using PCI, PCIE, AGP
connectors. Figure 5.1 B shows the tool detected Intel x8086 PCIE Bridge connected to the
system.
Figure 5.1 B Intel’s x8086 PCIE bridge device
Now search for the Xilinx PCIE card with the VID and DID a number as we entered in the
configuration of the core. Figure 5.1 C shows the Xilinx PCIE device.
Figure 5.1 C Xilinx’s ML-605 device
39
Now the time is to check, that we can configure the configuration registers, read and write from
the memory. Select from the radio button Number of configuration registers to 64 and hit refresh
dump. This will list all the configurable register in the window next to it as shown in Figure 5.1
D.
Figure 5.1 D Configuration Registers
40
Select any development register, which is available to see its location and value writteninit, also
the next memory location it is pointing. Figure 5.1 E shows the register value editing and
checking. Here, we have selected register 48. As you can see, it shows me the current
configuration value in edit configuration blocks in hex format. we can change the value for any
configuration register. Also, it points, to the next configuration register location by value 6005
which means next configuration register available is 60.
Figure 5.1 E Configuration registers editing
41
Now, there are a few configuration registers, which will tell us valuable information about the
core like generation of the PCIE core, maximum number of lanes configured and supported,
from which we can calculate the data transfer rate. Figure 5.1 F shows this configuration
registers. Here, we can see Link Capability Register 6C, which has value F481, the last two
digits 81 implies that the core is having x8 lanes and gen 1 configuration. Based on this, we can
say that it has data transfer rate of 2.5GTPS per lane, which means 2.5-GBytes/Sec as we have 8
lanes.
Figure 5.1 F Configuration Register Read
42
Next is to check read and write from the memory using PCIE core. As in core implementation,
we had selected Base Address Register 0 with the size of 32-bit. This is shown in Figure5.1 G.
Then hit yes and it will show all 1-Kilo byte memory content.
Figure 5.1 G Memory Read (BAR)
43
As we do not have any value which is to be loaded into BRAM, it should show all value zeros or
blank. Figure 5.1 H shows memory read values all blank.
Figure 5.1 H Memory Read (Content)
Figure 5.1 I Memory Write
44
To write on the BRAM with the tool, since we don’t have data from actual image file yet. We are
using count values to be written on to the memory by selecting all locations and selecting radio
count button, as shown in Figure 5.1 I. But, if we have data file, we can directly load the file
from the load file option as shown in the Figure 5.1 J.
Figure 5.1 J Memory writes using file
45
But, here to write count values into the memory locations, select all the memory location and tick
on count and hit write memory button, then do refresh view. Figure 5.1 K shows the output of
hex value.
Figure 5.1 K Memory writes using count value
46
I, also checked by transferring image via PCIE and getting it back, by using xilinx demo image
filtering module. As shown in Figure 5.1 L, this is reference design, which we programed on
PCIE of ML-605. run the image filtering with the identity setting. As you can see, both the
images are identical because, we did not use any filtering action.but, it proves my point, that we
can transfer imageusing it and reproduce it.
Figure 5.1 L Image Transfer
47
You can also check, The performance of the PCIE, by configuring xilinx’s XAPP1052
DMA performance Demo IP core logic. This application loads the demo DMA read and write
files, which you can load on Board.
We used xapp1052.zip DMA performance demo module from xilinx download .after
downloading the design by following Xapp1052.pdf document you can invoke the GUI as shown
in the Figure 5.1 M called DMA Performance Demo GUI . User can set the data type,TLP
size,read or write perameter to check.
Figure 5.1 M DMA Performance Demo GUI
48
We checked the DMA read and Write to check the actual data transfer rate of xilinx virtex6
FPGA PCIE v1.7 performance. for read performance checking select read and hit start in Figure
5.1 M screen then, as shown in Figure 5.2 N DMA performance is half of the number seen in
screen in Mbps =47610.61/2=23805.30 ,Which is ~2.3 GTPS.
Figure 5.1 N Read Performances
Then same way for Write perfoemance select write and hit start in Figure 5.1M screen then, as
shown in Figure 5.2O DMA performance is half of the number seen in screen in Mbps
=56109.59/2= 28054.79,Which is ~2.8 GTPS.
Figure 5.1 O Write Performance
49
5.2 Simulation
Load the design in xilinx ise and compile the core,do the simulation. Before that, check the
report for detailed analysis. The code and all the simulation files are provided with the attached
CD. I used verilog for component generation and simulation. I used Xilinx ISE 13.2 to simulate
the design for checking configuration register’s content as shown in Figure 5.2 A Design
Compilation.
Figure 5.2 A Design compiling
The detailed analysis report is shown in the Figure 5.2 B. This shows the number of LUTs,
FIFO, and Other logic components used to form the design.
50
Figure 5.2 B Detailed Analysis Report
51
The simulation waveform is generated, but it’s all core generated output, so it is hard to read and
interpret the meaning of each and every signal. Figure 5.2 C shows the Configuration registers
initialized with default values.
Figure 5.2 C Configuration Register Read
Figure 5.2 D BAR Data Read
52
As we did not generated such high speed data to be read from memory, so, using ise simulation it
is hard to verify the transfer rate. But, we can check that registers are initialized with default
values. But, to see the actual data transfer through the PCIE IP core. We used chip scope pro tool
and opened the projects .cfg File from the result folder and made trace at 2 points in IP core.
Figure 5.2 E Data TLP
Figure 5.2 E shows data TLP, which is at the transaction level point on IP core by chip scope,
and it shows here trn_td is having some data value, when the trn_tsrc_rdy is active low.
Similarly, we made two points data read observation by chip scope at physical layer. As you can
see from Figure 5.2 F physical layer memory read packet. The signal phy_rd shows data read
from the memory.
Figure 5.2 F Physical Layer Memory Read Packet
53
The second point on the physical layer simulation for chip scope as shown in Figure 5.2 G has
transaction layer header added to data in every phy_rd.
Figure 5.2 G Physical Layer Data TLP
54
CHAPTER 6
CONCLUSION AND FUTURE SCOPE
Xilinx’s Virtex6 FPGA IP core for PCI-Express is generated and implemented for data transfer
rate of 2.5-Giga bits per second rate. We learned about detailed architecture and configuration of
PCIE IP core. The design is implemented on ML-605 evaluation kit.
Current configuration is running and giving data transfer rate of ~2.5-GBPS for PCIE gen1 x 8
lanes. Using LogiCore’s Core generator tool and with Xilinx xapp-1052, Xapp-045 application
software’s for PCIE core, we had checked the image transfer, and memory read and write
transactions. For configuring and controlling PCIE cores registers, we used third party tool called
PCITREE.
Initially, we had issues in selecting the right hardware for ML-605 evaluation kit installation.
mainly PCIE x8 port finding in the server system. Then driver installation for windows xp sp3
and Xilinx ISE 13.2 configuration. As the latest release, ISE-14.2 was having issues with the
PCIE core 1.3v and 1.6v.
Future scope to this project is PCIE core module is the main building block for high speed data
transfer for our data acquisition system. We have ADC and PCIE modules working. Now we
need to do memory mapping and EZDMA coding. After that, we can use the PCIE core
generator module at five different levels in design and our project is ready.
55
REFERENCES
[1] Xilinx FPGA design modules reference materials, Retrieved on 15 September 2012 from
http://www.xilinx.com/support/documentation/ip_documentation/
v6_PCIE_ug517.pdf /
v6_PCIE_ds715.pdf/
xtp044.pdf/
xtp025.pdf/
xtp025.pdf/
ug671.pdf/
[2] Xilinx Virtex 6 IP Block FPGA design modules materials, Retrieved on 25 September 2012
from http://www.xilinx.com/products/ipcenter/V6_PCI_Express_Block.html
[3] PCIE specification reference materials, Retrieved on 15 September 2012 from
http://www.pcisig.com/specifications/pciexpress/
[4] PCIE working and understanding materials, Retrieved on 17 September 2012 from
http://en.wikipedia.org/wiki/PCI_Express/
[5] PCIE working and understanding materials, Retrieved on 15 September 2012 from
http://computer.howstuffworks.com/pci-express.html/
[6] EBook: PCIE reference material, Retrieved on 15 September 2012 from
http://www.mindshare.com/files/ebooks/pci%20express%20system%20architecture.pdf/
[7] Xilinx Virtex 5 IP core PCIE Block design modules materials Retrieved on 20 October 2012
from http://www.em.avnet.com/en-us/design/drc/Pages/Xilinx-Virtex-5-LXTSXT-PCI-Express-
Development-Kit.aspx/
56
[8] UART and Bridge connectivity driver reference, Retrieved on 15 September 2012 from
http://www.exar.com/connectivity/uart-and-bridging-solutions
[9] Basic Knowledge of data acquisition system By National Instruments, Retrieved on 15
September 2012 from http://www.ni.com/white-paper/3536/en