embedded solutions for programmable logic designs - xilinx · for this year’s embedded systems...

R

INSIDENew EDK 8.1 Simplifies Embedded Design

Change Is Good

ESL Tools for FPGAs

Algorithmic Acceleration Through Automated Generation of FPGA Coprocessors

Bringing Floating-Point Math to the Masses

INSIDENew EDK 8.1 Simplifies Embedded Design

Change Is Good

ESL Tools for FPGAs

Algorithmic Acceleration Through Automated Generation of FPGA Coprocessors

Bringing Floating-Point Math to the Masses

Issue 3March 2006

Endless PossibilitiesEndless Possibilities

E M B E D D E D S O L U T I O N S F O R P R O G R A M M A B L E L O G I C D E S I G N S

EmbeddedEmbeddedmagazinemagazine

Enabling success from the center of technology™

1 800 332 8638www.em.avnet.com

© Avnet, Inc. 2006. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

Avnet Electronics Marketing offers a series of technical, hands-on

SpeedWay Design Workshops™ that will dramatically accelerate

your learning curve on new application solutions, products and

technologies like the Philips-Xilinx PCI Express two-chip solution.

Our FAE presenters use detailed laboratory exercises and Avnet

developed design kits to reinforce the key topics presented during

the workshops, ensuring that when you leave the class you will be

able to apply newly learned concepts to your current design.

• Every workshop features an Avnet developed design kit

• Workshops are systems and solutions focused

• Design alternatives and trade-offs for targeted applications are discussed

For more information about upcoming Xilinx SpeedWay workshops, visit:

www.em.avnet.com/xilinxspeedway

Support Across The Board.™

Accelerate Your Learning Curveon New Application SolutionsXilinx Spring 2006 SpeedWay Series

• Xilinx MicroBlaze™Development Workshop

• Xilinx PowerPC® Development Workshop

• Xilinx Embedded Software Development Workshop

• Introduction to FPGA Design Workshop

• Creating a Low-Cost PCI Express Design Workshop

• Embedded Networking with Xilinx FPGAs Workshop

• Xilinx DSP Development Workshop

• Xilinx DSP for Video Workshop

• Improving Design PerformanceWorkshop

WEndless PossibilitiesWelcome to our third edition of Xilinx Embedded Magazine. As we prepared this issue, the themefor this year’s Embedded Systems Conference – “Five Days, One Location, Endless Possibilities” –resonated with the array of potential articles. Simply stated, we seemed to have endless possibilitiesfor our embedded solutions to choose from and to share with you.

To capitalize on this theme, our ease-of-use initiative continues with Xilinx® Platform Studio andthe Embedded Development Kit (EDK), as we recently released our latest version, 8.1i. Thiscomes on the heels of EDN’s recognition of our 32-bit MicroBlaze™ soft-processor core as oneof the “Hot 100 Products of 2005.” Taken together, the MicroBlaze core, the industry-standardPowerPC™ core embedded in our Virtex™ family of FPGAs, and a growinglist of IP and supported industry standards offer more options than ever to create, debug, and launch an embedded system for production.

The latest version of our kit serves one of the greatest appeals that the embed-ded solution holds for our FPGA customers – to create a “just-what-I-needed”processor subsystem that “just works.” In so doing, our customers can concentrateon the added value that differentiates their products in their marketplace. Hereagain, “endless possibilities” resonates with unlimited design flexibility.

In this issue of Embedded Magazine we offer a collection of diverse articles unlocking the endlesspossibilities with Xilinx platforms. We welcome industry icon Jim Turley and his clever insightregarding shifts in the embedded industry with his article “Change Is Good.” In addition, ourpartners Echolab, Impulse, PetaLogix, Poseidon, Teja, Avnet, and Nu Horizons highlight their latest innovations for our embedded platforms. Our own experts provide tutorials on the latestrelease of EDK 8.1, along with a background look at the newly launched Xilinx ESL Initiative.

Join us as we plumb the depths of these exciting new embedded solutions. I’m sure you’ll find ourthird edition of Embedded Magazine informative and inspiring as we endeavor to help you unlockthe power of Xilinx programmability. The advantages are enormous, the possibilities … endless!

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780

© 2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. PowerPC is atrademark of IBM, Inc. All other trademarks are theproperty of their respective owners.

The articles, information, and other materials included inthis issue are provided solely for the convenience of ourreaders. Xilinx makes no warranties, express, implied,statutory, or otherwise, and accepts no liability with respectto any such articles, information, or other materials ortheir use, and any use thereof is solely at the risk of theuser. Any person or entity using such information in anyway releases and waives any claim it might have againstXilinx for any loss, damage, or expense caused thereby.

Mark Aaldering

Vice PresidentEmbedded Processing & IP Divisions

PUBLISHER Forrest [email protected]

EDITOR Charmaine Cooper Hussain

ART DIRECTOR Scott Blair

ADVERTISING SALES Dan Teie1-800-493-5551

www.xilinx.com/xcell/embedded

Embeddedmagazine

YOU CANDELIVER INNOVATIONYOU CANDELIVER INNOVATION

Time to market, developer and programmer productivity, choice in fabrication facilities and EDA retooling costs forsmaller and smaller geometries are all putting tremendous strain on system development groups around the globe.

Enter IBM. Whether your design priorities are low power or high performance, or both, IBM’s Power Architecture™microprocessors and cores can help you accelerate innovation in your designs. Find out what the world's fastestsupercomputer, Internet routers and switches, the Mars Rover, and the next generation game consoles all have incommon. For more information visit ibm.com/power

IBM

, the

IBM

logo

and

the O

n De

man

d Bu

sines

s log

o ar

e reg

ister

ed tr

adem

arks

or t

rade

mar

ks o

f Int

erna

tiona

l Bus

ines

s Mac

hine

s Cor

pora

tion

in th

e Uni

ted

Stat

es an

d/or

oth

er co

untri

es. O

ther

com

pany

, pro

duct

and

serv

ice n

ames

may

be t

rade

mar

ks o

r ser

vice m

arks

of o

ther

s. ©

2006

IBM

Cor

pora

tion.

All

right

s res

erve

d.

Welcome .................................................................................................................3

Contents ...................................................................................................................5

ARTICLES

New EDK 8.1 Simplifies Embedded Design ..................................................................6

Change Is Good .....................................................................................................10

Implementing Floating-Point DSP .................................................................................12

ESL Tools for FPGAs..................................................................................................15

Algorithmic Acceleration Through Automated Generation of FPGA Coprocessors ...............18

Generating Efficient Board Support Packages...............................................................24

Bringing Floating-Point Math to the Masses ..................................................................30

Packet Subsystem on a Chip......................................................................................34

Accelerating FFTs in Hardware Using a MicroBlaze Processor.........................................38

Eliminating Data Corruption in Embedded Systems........................................................42

Boost Your Processor-Based Performance with ESL Tools..................................................46

CUSTOMER SUCCESS

Unveiling Nova .......................................................................................................51

WEBCASTS

Implementing a Lightweight Web Server Using PowerPC and Tri-Mode Ethernet MAC in Virtex-4 FX FPGAs .........................................................57

BOARDS

Development Kits Accelerate Embedded Design ...........................................................61

PRODUCTS

MicroBlaze – The Low-Cost and Flexible Processing Solution ..........................................64

C O N T E N T S

E M B E D D E D M A G A Z I N E I S S U E 3, M A R C H 2 0 0 6

6

10

15

18

30

6

10

15

18

30

by Jay GouldProduct Marketing Manager, Xilinx Embedded Solutions MarketingXilinx, [email protected]

After achieving an industry milestone,what’s next? In 2005, the Xilinx® PlatformStudio tool suite (XPS) included in theEmbedded Development Kit (EDK) wonthe IEC’s DesignVision Award for innova-tion in embedded design. The revolution-ary approach of design wizards broughtabstraction and automation to an other-wise manual and error-prone developmentprocess for embedded system creation.

The year 2006 brings a new version 8.1update to the Platform Studio tool suite,with an emphasis on simplifying the devel-opment process and providing a more vis-ible environment. The result is a shortenedlearning curve for new users and an evenmore complete and easier-to-use environ-ment for existing designers.

New EDK 8.1 Simplifies Embedded DesignNew EDK 8.1 Simplifies Embedded Design

6 Embedded magazine March 2006

Platform Studio enhancements streamline processor system development. Platform Studio enhancements streamline processor system development.

Just getting a complex design startedcan take a significant amount of time outof a critical schedule, so Xilinx started witha premise that the first steps to a workingcore design should be automated. TheXilinx Base System Builder design wizardwithin the Platform Studio tool suite pro-vides a step-by-step interface to walk youthrough the critical first stages of a design.Design wizards are a great innovationbecause they can provide a quick path to aworking core design even if you have min-imal expertise. The “smarter” the installwizard is, the fewer issues occur, and theless experience you need to have.

Pre-configured hardware/software devel-opment kits are also extremely valuable forgetting a design “off the napkin” and into aquick but stable state. Xilinx hardware/soft-ware development kits provide workinghardware boards, hardware-aware tools, andpre-verified reference designs. The benefithere is that you can power up hardware,download a working design to a board, andstart investigating a “working” core systemin a very short period of time, skipping pastthe delays and complexities of debuggingnew hardware, new firmware, and new soft-ware all at the same time.

The “Project” tab contains a variety ofhelpful information about the design,including specific Xilinx device selectionand settings (for example, a specificVirtex™-4 or Virtex-II Pro device withone or two PowerPC™ processor cores)and project file locations (hardware andsoftware project descriptions as well as logand report files for steps like synthesis), aswell as simulation setup details.

You can view software applicationsunder the “Applications” tab, which pro-vides access to all of the C source and head-er files that make up the embedded systemdesign. This view also provides views of thecompiler options and even the block RAMinitialization process.

The “IP Catalog” tab contains in-depthinformation about the IP cores created,bought, or imported for the design. Xilinxprovides several scores of processing IPcores in the Embedded Development Kitsoftware bundle as well as some high-valuecores for time-limited evaluations. You canresearch Xilinx processor IP at www.xilinx.com/ise/embedded/edk_ip.htm.

The middle panel is the “Connectivity”view, and the adjacent panel to the right ofthat is the associated “System Assembly”view. The connectivity view gives a clearvisual of the design busing structure andalso provides a dynamic tool for creatingnew or editing existing connections. Thecolor-coded view quickly makes it clear –even to novice users – the specifics of thebus type and how it might relate to IP. Forexample, in this view, peripherals connect-ed to the PLB (processor local bus) are pre-sented in orange; OPB (on-chip peripheralbus) connections are green; and point-to-point connections with a processor core, inthis case the PowerPC 405, are in purple.The panel “filter” buttons allow you to cus-tomize or simplify the connection views sothat you can focus on specific bus elementswithout the distraction of other elements.

Platform Studio reduces the errors thata designer might make by maintaining cor-rect connections by construction – that is,

A majority of the embedded designcycle, before full system verification, isspent iterating on the core design, incre-mentally introducing new features, addingindividual capabilities, and repeatedlydebugging after each step. Because this isexcessively tedious and time consuming,this stage should be as easy and streamlinedas possible. Version 8.1 has a focus on mak-ing common (and repetitive) tasks simpleand intuitive, benefiting both new andexisting users.

All Users Benefit from V8.1Xilinx has updated the main user inter-face of Platform Studio to provide anintuitive feel for both hardware and soft-ware engineers, making multiple viewsand customization easy for all. The inte-grated development environment (IDE)in Figure 1 displays a wide array of infor-mation, but also allows you to filter viewsand customize the toolbars. The left-handpane provides an industry-standard “tab”method of displaying or hiding informa-tion panels on the design “Project,”“Applications,” or “IP Catalog.” Just tog-gle on the tab of choice to display thecontents of that pane.

March 2006 Embedded magazine 7

Figure 1 – New 8.1 Platform Studio GUI

Xilinx has updated the main user interface of Platform Studio to provide an intuitive feel for both hardware and software engineers ...

XPS will only display connectionoptions for compatible bus types.This saves debug headaches withtools that allow incompatibleconnections.

The system assembly view (seeFigure 2) more clearly displays anexample of dynamic system con-struction using a “drag-and-dropconnectivity instantiation.” Inthe figure, the gray highlighted“opb_uartlite” IP core is selectedon the left panel from the IPCatalog and has been draggedand dropped into the rightassembly window, creating a newOPB bus connection optionautomatically; just mouse-click toconnect. The views on the rightalso provide helpful informationsuch as IP types for perusing andIP version numbers for projectversion control. Now, at a glance,you can distinguish the systemstructure without reading reamsof documentation.

However, if design documen-tation is what your project andteam require, Platform Studio 8.1has the powerful capability togenerate full design-referencematerial, including a full block diagramview of the system elements and their inter-connections. This automatic generation ofthe docs saves valuable time (instead of cre-ating the materials manually) and reduceserrors by creating the materials directlyfrom the design. This method keeps thedocs and the design accurately in sync aswell as displaying a clear high-level view ofthe entire project.

New Enhancements Help Existing UsersCurrent Platform Studio users will bepleased to see advances in the support ofsophisticated software development, IPsupport, and the migration or upgrades ofolder designs. Figure 3 is an example ofwhat the IP Catalog tab might look like fora design, including all IP cores categorical-ly grouped on the left-hand side by logicalnames. The specific IP cores will display aversion number for design control as well

as a brief language description if the namesare too brief for context. This view allowsyou to manage your old and current IP aswell as future IP upgrades (more powerfulversions of cores with more features butoften faster and smaller in size).

Additional information is available aswell, such as which processor cores the IPsupports. Because Xilinx offers flexible sup-port for both high-performance PowerPChard and flexible MicroBlaze™ soft-proces-sor cores, it is useful to know which IP coresare dedicated to one processor, the other, orboth. In fact, a right mouse-click on the IPfrom the catalog yields quick access to theIP change history as well as complete PDFdatasheets on the specifics. Software driversfor the peripherals have a similar platformsettings view for clarity, including versioncontrol and embedded OS support.

When a new version of tools and IPbecomes available, the upward design

migration ought to be as painlessas possible. Nobody wants to re-invest design, debugging, and testtime to move an older design to anewer set of tools or IP. However,there are often great advantages innew IP/tools that make it advan-tageous to upgrade. PlatformStudio 8.1 has a migration capa-bility (Figure 4) that steps youthrough a wizard to automate andaccelerate the process.

XPS 8.1 can browse existingdesign projects, flag out-of-dateprojects and IP cores, and thenwalk you though the process ofconfirming automated updatesto the new IP and project files.The migration wizard updatesthe project description files andsummarizes the migrationchanges in document form.Minimizing labor-intensive stepsmeans that you can take advan-tage of new advancements with-out as much manual re-enteringor porting of designs.

Savvy software developersworking on more sophisticatedcode applications will be happywith the enhancements to the

XPS Software Development Kit IDE,based on Eclipse. The XPS-SDK has anenhanced toolbar that more logicallygroups similar functions and buttonswhile still allowing user customization.


Figure 2 – System assembly view

Figure 3 – XPS IP catalog

Figure 4 – XPS design migration

Version 8.1 introduces a more powerfulC/C++ editor supporting code folding offunctions, methods, classes, structures, andmacros, as well as new compiler advance-ments. This new support provides the abil-ity to specify linker scripts and customizedcompiler options for PowerPC andMicroBlaze processor cores, plus a C++class creation wizard. Combine this power-ful software environment with the innova-tive performance profiling views andunique XPS capability of integrated hard-ware/software debuggers, and 8.1 users willbe creating better, more powerful embed-ded systems in less time than ever before.

ConclusionThe award-winning Platform Studio hasalready streamlined embedded systemdesign. Automated design wizards and pre-configured hardware/software developmentkits help kick-start designs while reducingerrors and tail-chasing.

Now that we have an industry-provensuccess in ramping-up the “getting started”process, it is time to improve the time-con-suming and cyclical nature at the heart ofthe development process. Create – Debug– Edit – Repeat. Have you ever used a com-puter-aided tool where most of the stepswere intuitive? Where you could guesswhat a button did before you read themanual or saw a screen in which the con-tents were all self-evident?

EDK/XPS version 8.1 focuses on ease-of-use improvements across the board, includ-ing enhancements to the main userinterface, the software development envi-ronment (including editing and compiling),the upgrading of IP, the migrating of oldprojects, documenting designs, viewing andediting bus-based systems, and much more.

By making common tasks simple andintuitive, we can make designing a little biteasier for experienced embedded engineersas well as those brand-new to designingwith processors in programmable FPGAplatforms. Use the extra time saved duringthe development process to innovate yourown embedded products.

For more information about EDK ver-sion 8.1 and all of our embedded process-ing solutions, visit www.xilinx.com/edk.


What’s New

www.xilinx.com/xcell

To complement our flagship publication Xcell Journal, we’ve recently launched three new technology magazines:

� Embedded Magazine, focusing on the use of embedded processors in Xilinx® programmable logic devices.

� DSP Magazine, focusing on the high-performance capabilities of our FPGA-based reconfigurable DSPs.

� I/O Magazine, focusing on the wide range of serial and parallel connectivity options available in Xilinx devices.

In addition to these new magazines, we’ve created a family of Solution Guides, designed to provide useful

information on a wide range of hot topics such as Broadcast Engineering, Power Management,

and Signal Integrity.Others are planned throughout the year.

See all of the new publications on our website:

by Jim TurleyEditor in Chief, Embedded Systems DesignCMP Media [email protected]

If you shout “microprocessor” in a crowdedtheatre, most people will think “Pentium.”Intel’s famous little chip has captured thepublic imagination to the point where manypeople think 90% of all the chips madecome streaming out of Intel’s factories.

Nothing could be further from thetruth. The fact is, Pentium accounts forless than 2% of all of the microprocessorsmade and sold throughout the world. Thelions’ share – that other 98% – are proces-sors embedded into everyday appliances,automobiles, cell phones, washers, dryers,DVD players, video games, and a millionother “invisible computers” all around us.PCs are a statistically insignificant part ofthe larger world – and Pentium sales are arounding error.

Take heart, embedded developers, forthough you may toil in obscurity, yourdeeds are great, your creations mighty, andyour number legion. With few exceptions,most engineers are embedded systems devel-opers. We’re the rule, not the exception.

Change Is GoodChange Is Good


Hardware designers and software designers can’t often agree, but there is a middle ground that both might enjoy.Hardware designers and software designers can’t often agree, but there is a middle ground that both might enjoy.

Processors As SimulatorsWhat is exceptional is the number of differ-ent ways we approach a problem. All PCslook pretty much the same, but there’s nosuch thing as a typical embedded system.They’re all different. We don’t standardizeon one operating system, one processor (oreven processor family), or one power sup-ply, package, or peripheral mix. Among 32-bit processors alone there are more than 100different chips available from more than adozen different suppliers, each one withhappy customers designing systems aroundthem. Hardly a homogenous group, are we?

There’s even a school of thought thatmicroprocessors themselves are amistake – a technical dead-end.The theory goes that microproces-sors merely simulate physical func-tions (addition, subtraction, FFTanalysis), rather than performingthe function directly. Decodingand executing instructions, han-dling interrupts, and calculatingbranches is all just overhead. Aclose look at any modern processorchip would seem to bear this theo-ry out: only about 15% of thechip’s transistors do any actualwork. The rest are dedicated tocracking opcodes, handling flagbits, routing buses, managingcaches, and other effluvia necessaryto make the hardware do what thesoftware tells it to do.

The only reason processorswere ever invented in the firstplace (so the thinking goes) isbecause they were more malleablethan “real” hardware. You could changeyour code over time – but you couldn’tchange your hardware.

But that isn’t true any more.Following this line of reasoning, the

right approach is to do away with proces-sors and software altogether and imple-ment your functions directly in hardware.Forget that 85% of processor overheadlogic and get right down to the nuts andbolts. Make every one of those little tran-sistors work for a living. And hey, if youchange your mind, you can change yourhardware – if it’s programmable.

More to the point, it’s no longer aneither/or decision. The two disciplines arenot mutually exclusive; engineering is nota zero-sum game. We don’t have to comedown firmly on the side of hardware orsoftware; we can straddle the middleground as it suits us. When your hardwareis programmable, you can choose to “pro-gram” it or “design” it using traditional cir-cuitry methods. Take your pick. Letwhimsy or convention be your guide.

Engineers, like most craftsmen, placegreat stock in their tools. A recent surveyrevealed that most developers choosetheir tools (compiler, logic analyzer,IDE) first and the “platform” they workon second. For example, they let theirchoice of compiler determine theirchoice of processor, not vice versa. Thehardware – a microprocessor, generally –is treated as a canvas or work piece onwhich they ply their trade. This comes asa bit of a blow to some of the more tra-ditional microprocessor makers, who’dalways assumed that the world revolvedaround their instruction set.

The takeaway from this part of thesurvey was that keeping developers intheir comfort zone is paramount.Engineers don’t like to modify their skillsor habits to accommodate someone else’shardware. Instead, the hardware shouldadapt to them. In the best case, the hard-ware should even adapt to a code jockeyone day and a circuit snob the next.Different tools for different approaches,but with one goal in mind: to create agreat design within time and budget (andpower, and heat, and pinout, and cost,and performance) constraints.

There hasn’t been anything to accom-modate this flexibility until pretty recently.Hardware was hardware; code was code.But with “soft processors” in FPGAs livingalongside seas of gates and coprocessors,we’ve got the ultimate canvas for creativedevelopers. Whether it’s VHDL or C++,these new chips can be customized inwhatever way suits you. They’re as flexibleas any software program, and as fast andefficient as “real” hardware implementa-tions. We may finally have achieved thebest of both worlds.

The Malleable EngineerSo now we’re faced with the proverbial (andoverused) paradigm shift. We can toss outeverything we know about programming,operating systems, software, real-time code,compilers, boot loaders, and bit-twiddlingand go straight to hard-wired hardwareimplementations.

Or not.Maybe we like programming. There’s

something about software design thatappeals to the inner artist in us. It’s a wholedifferent way of thinking compared tohardware design, at least for a lot of engi-neers. Software is like poetry; hardware is

like architecture. There’s plenty of badpoetry because anyone can do it, but youdon’t see people tossing up buildings just tosee if they stand. Programming requiresmuch less discipline and training thanhardware engineering. That’s why there areso many programmers in the world.

This is a good thing. Really. The easierit is to enter the engineering profession, themore (and better) engineers we’re likely tohave. And since hardware- and software-design mindsets are different, we get todraw from a bigger cross section of thepopulace. Variety is good.


by Ji r̆i Kadlec DSP Researcher UTIA Prague, Czech Republic [email protected]

Stephen P.G. ChappellDirector, Applications EngineeringCeloxica Ltd., Abington, [email protected]

For developers using FPGAs for theimplementation of floating-point DSPfunctions, one key challenge is how todecompose the computation algorithminto sequences of parallel hardwareprocesses while efficiently managing dataflow through the parallel pipelines ofthese processes.

In this article, we’ll discuss our experi-ences exploring architectures with Xilinx®

PicoBlaze™ controllers, and present adesign strategy employing the ESL tech-niques of model-based and C-baseddesign to demonstrate how you can rapid-ly integrate highly parameterizable DSPhardware primitives into power-efficient

high-performance implementations inSpartan™ devices.

Hardware Acceleration and ReuseHigh-performance implementations offloating-point DSP algorithms in FPGAsrequire single-cycle parallel memoryaccesses and effective use of pipelinedarithmetic operators. Many common DSPvector and matrix operations can be splitinto batch calculations fulfilling theserequirements. Our architectures compriseXilinx PicoBlaze worker processors, eachwith a dedicated DSP hardware accelerator(Figure 1). Each worker can do preparato-ry tasks for the next batch in parallel withits hardware accelerator. Once the DSPhardware accelerator finishes the computa-tion, it issues an interrupt to the worker.The worker’s job is to combine the acceler-ated parts of the computation into a com-plete DSP algorithm.

It is ideal if you limit implementationsto the batch operations of each workerstarting in a block RAM, performing a rel-

atively simple sequence of pipelined oper-ations at the maximum clock speed andreturning the result(s) back to anotherblock RAM. You can effectively map theseprimitives to hardware, including thecomplete autonomous data-flow controlin hardware. You can also code the relateddedicated generators of address countersand control signals in Handel-C, usingseveral synchronized do-while loops.Simulink is effective for fast derivation ofbit-exact models of the batch calculationsin DSP hardware accelerators.

Floating-Point Processor on a Single FPGA Let’s consider an architecture for the evalu-ation of a 1024 x 1024 vector product in18m12 floating point. (In the format AmB,A is for the word length and B is for thenumber of bits in the mantissa, includingthe leading hidden bit representing 1.0.)

We implemented this architectureusing five PicoBlaze processors on a singleFPGA: one master and four simplifiedworkers (Figure 1). The master is connect-

Implementing Floating-Point DSP


Using PicoBlaze processors for high-performance, power-efficient, floating-point DSP.

ed to the workers by I/O-mapped dual-ported block RAMs organized in 2,048 8-bit words. The master maintains thereal-time base with 1 µs resolution and pro-vides RS232 user-interface functions. Eachworker serves as a controller to a dedicatedfloating-point DSP hardware acceleratorconnected through three dual-ported block

using one 18m12 multiplier (FP MUL)and one 18m12 adder (FP ADD).

Scalable, Short-Latency Floating-Point ModulesWe used a newly released version of a scala-ble, short-latency pipelined floating-pointlibrary from Celoxica to build our DSP hard-ware accelerators. Table 1 considers some ofthe parameterizations of this FPGA vendor-independent library to the formats 18m12,32m24, and 64m53. The library includesIEEE754 rounding, including the round toeven. It provides bit-exact results to theXilinx LogiCORE™ floating-point opera-tors (v2.0), with latency set to approximatelyone half. The resulting maximum systemclock is compatible with PicoBlaze andMicroBlaze™ embedded processors.

Simulink and the DK Design Suite Our design flow is based on the bit-exactmodeling of Handel-C floating-point unitsin a Simulink framework, where theHandel-C is developed in the DK DesignSuite combined simulation and synthesisenvironment. This enabled us to decomposea floating-point algorithm into a sequence ofsimple operations with rapid developmentand testing of different combinations.

Step 1: Model in SimulinkFirst, we built a model of the DSP hardwareaccelerator in Simulink (Figure 2). The datasources and sinks in this model will be theblock RAMs shared with the PicoBlazeworker in the final implementation.

Because the FPGA floating-point oper-ations are written in cycle-accurate andbit-exact Handel-C, we benefited from asingle source for both implementationand simulation. For modeling, we export-ed the Handel-C functions to S-functionsusing Celoxica’s DK Design Suite. Wethen incorporated these into a bit-exactSimulink model. In this fast functionalsimulation, we use delay blocks inSimulink to model pipeline stages (see the5-stage pipeline of the FP ADD operatorand related registers in Figure 2). We usedseparate Simulink subsystems to modelthe bit-exact operation of the final“pipeline flushing,” or “wind-up opera-

RAMs organized in 1,024 18-bit words.Two block RAMs hold source vectors andone holds results data. In this case, theworkers perform one quarter of the com-putation each, namely a 256 x 256 vectorproduct. The DSP hardware acceleratorsare implemented in hardware, from blockRAM data source to block RAM data sink,


PicoBlaze Master

PicoBlaze Workers

DSP Hardware Accelerators

Dual-Port Block RAMs

Part: 18m12: 32m24: 64m53: xc2v1000-4 125 MHz 110 MHz 100 MHzxc3s1500L-4 84 MHz 84 MHz 72 MHz

Pipelined: FF LUT Pipe FF LUT Pipe FF LUT Pipe

ADD 834 793 5 1158 1290 5 1686 2007 5

MULT 639 488 3 967 626 4 2029 1256 5

F2FIXPT 581 637 4 649 744 4 808 1053 4

FIXPT2F 695 709 6 792 787 6 1008 946 6

Sequential: Cycles Cycles Cycles

DIV 739 605 17 987 772 29 2119 1143 58

SQRT 766 604 16 1053 802 28 1729 1303 57

Table 1 – Used flip-flops, LUTs, pipeline/latency, and maximum clock for Celoxica floating-point modules in system implementations in the Celoxica RC200E (Virtex-II FPGA) and RC10 (Spartan-3L FPGA) boards. Modules are pipelined, with the exception of DIV and SQRT.

Figure 1 – PicoBlaze-based architecture for floating-point DSP. The DSP hardware accelerators are modeled and implemented using Celoxica DK.

tion.” In this case, six partialsums have to be added by asingle reused FP ADD mod-ule (Figure 3). The corre-sponding hardware computesthe final sum of the partialsums by reconnecting thepipelined floating-pointadder to different contextsfor several final clock cycles.

Step 2: Cycle-AccurateVerification Our next stage was to createtest vectors using Simulinkand feed these into a bit-exact and cycle-accurate sim-ulation of the DSP hardwareaccelerator in the DKDesign Suite’s debugger.Once we confirmed identicalresults for both the DK andSimulink models, we com-piled the Handel-C code toan EDIF netlist.

Step 3: Hardware TestWe took advantage of a lay-ered design approach byusing a single communica-tion API for data I/O func-tions that applies to bothsimulation and implementa-tion. This allowed us to veri-fy the DSP hardware accelerator design onreal FPGA hardware by “linking” with anappropriate board support library forimplementation. We can optionally insertthis hardware test back into the Simulinkmodel for hardware-in-the-loop simula-tions. The test on FPGA hardware providesreliable area and clock figures.

Step 4: Create Reusable Module and Connect to WorkerFinally, we treated the verified blockRAM-to-block RAM DSP hardware accel-erator as a new module and integrated itinto our main design by compiling theHandel-C to EDIF or RTL using the DKDesign Suite. This reusable module is con-nected to the PicoBlaze network by wiringthe ports of the block RAMs and the

appropriate enable and controller inter-rupt signals. At this stage we tested thefunction of the DSP hardware acceleratorunder worker control using memory dumpuser support from the master.

Step 5: Develop Complete DSP DesignWe next assembled the complete design ofworkers and master, moving to assembly pro-

gramming of individual PicoBlazeworkers and their interactions.

Performance ResultsTest results using the CeloxicaRC200E (Virtex™-II FPGA)and RC10 (Spartan™-3L FPGA)boards are shown in Table 2. It isinteresting to compare the powerconsumption of the PicoBlazenetwork architecture on Virtex-IIdevices (RC200E) with the iden-tical design on the low-power 90nm Spartan-3L device (RC10).The latter part gives a highlyfavorable floating-point perform-ance-to-power ratio.

Conclusion With minimal overhead, PicoBlazeworkers add flexibility to floating-point DSP hardware acceleratorsby their ability to call and reusesoftware functions (even if inassembly language only). Our pro-posed architecture enables moreflexible and generic floating-pointalgorithms without the additionalincrease of hardware complexityassociated with hardware-onlyimplementations due to irregulari-ties and complex multiplexing ofpipelined structures. PicoBlazecores are compact, simple, and

therefore manageable, without designersneeding to combine too many new skills.

The use of floating-point designs devel-oped using the DK Design Suite in com-bination with a Simulink frameworkprovides an effective design path that isrelatively easy to debug and scalable tomore complex designs.

Spartan-3L technology considerablyreduces power consumption compared toVirtex-II devices. Considering the bene-fits in terms of performance/power/price,Spartan-3L FPGA implementations offloating-point DSP pipelines using net-works of PicoBlaze processors are aninteresting option.

You can find complete information onthe design and technology discussed inthis article at www.celoxica.com/xilinx.


Part: MHz MFLOPs mW

xc2v1000-4 100 700 1360

xc3s1500L-4 84 588 263

Table 2 – Results for 1024 x 1024 vector product in 18m12 floating point on the

Celoxica RC200E (Virtex-II FPGA) and RC10(Spartan-3L FPGA) boards

Figure 3 – Simulink subsystem based on Handel-C bit-exact models, including delay model of calculation wind-up

at the end of the vector product batch

Figure 2 – Simulink test bench for floating-point 18m12 vector product based on Handel-C bit-exact models

by Milan SainiTechnical Marketing ManagerXilinx, [email protected]

A fundamental change is taking place in theworld of logic design. A new generation ofdesign tools is empowering software develop-ers to take their algorithmic expressionsstraight into hardware without having tolearn traditional hardware design techniques.

These tools and associated designmethodologies are classified collectively aselectronic system level (ESL) design, broadlyreferring to system design and verificationmethodologies that begin at a higher level ofabstraction than the current mainstream reg-ister transfer level (RTL). ESL design lan-guages are closer in syntax and semantics tothe popular ANSI C than to hardware lan-guages like Verilog and VHDL.

How is ESL Relevant to FPGAs?ESL tools have been around for a while,and many perceive that these tools are pre-dominantly focused on ASIC design flows.The reality, however, is that an increasingnumber of ESL tool providers are focusingon programmable logic; currently, severaltools in the market support a system designflow specifically optimized for Xilinx®

FPGAs. ESL flows are a natural evolutionfor FPGA design tools, allowing the flexi-bility of programmable hardware to bemore easily accessed by a wider and moresoftware-centric user base.

Consider a couple of scenarios in whichESL and FPGAs make a great combination:

1. Together, ESL tools and programma-ble hardware enable a desktop-basedhardware development environmentthat fits into a software developer’sworkflow model. Tools can provideoptimized support for specific FPGA-based reference boards, which soft-ware developers can use to start aproject evaluation or a prototype. Theavailability of these boards and thecorresponding reference applicationswritten in higher level languagesmakes creating customized, hardware-accelerated systems much faster andeasier. In fact, software programmersare now able to use FPGA-based ref-erence boards and tools in much thesame way as microprocessor referenceboards and tools.

2. With high-performance embeddedprocessors now very common inFPGAs, software and hardwaredesign components can fit into asingle device. Starting from a soft-ware description of a system, youcan implement individual designblocks in hardware or softwaredepending on the applications’ per-formance requirements. ESL toolsadd value by enabling intelligentpartitioning and automated exportof software functions into equivalenthardware functions.

ESL promotes the concept of “explo-rative design and optimization.” Using ESLmethodologies in combination with pro-grammable hardware, it becomes possibleto try a much larger number of possibleapplication implementations, as well as rap-idly experiment with dramatically differentsoftware/hardware partitioning strategies.This ability to experiment – to try newapproaches and quickly analyze perform-ance and size trade-offs – makes it possiblefor ESL/FPGA users to achieve higher over-all performance in less time than it wouldtake using traditional RTL methods.

Additionally, by working at a moreabstract level, you can express your intentusing fewer keystrokes and writing fewerlines of code. This typically means a muchfaster time to design completion, and lesschance of making errors that requiretedious, low-level debugging.

ESL’s Target AudienceThe main benefits of ESL flows forprospective FPGA users are their produc-tivity and ease-of-use. By abstracting theimplementation details involved in gener-ating a hardware circuit, the tools are mar-keting their appeal to a software-centricuser base (Figure 1). Working at a higherlevel of abstraction allows designers withskills in traditional software programminglanguages like C to more quickly exploretheir ideas in hardware. In most instances,you can implement an entire design inhardware without the assistance of an

ESL Tools for FPGAsESL Tools for FPGAs


Empowering software developers to design with programmable hardware.Empowering software developers to design with programmable hardware.

experienced hardware designer. Software-centric application and algorithm develop-ers who have successfully applied thebenefits of this methodology to FPGAsinclude systems engineers, scientists, mathe-maticians, and embedded and firmwaredevelopers.

The profile of applications suitable forESL methodologies includes computational-ly intensive algorithms with extensive inner-loop constructs. These applications canrealize tremendous acceleration through theconcurrent parallel execution possible inhardware. ESL tools have helped with suc-cessful project deployments in applicationdomains such as audio/video/image process-ing, encryption, signal and packet process-ing, gene sequencing, bioinformatics,geophysics, and astrophysics.

ESL Design Flows ESL tools that are relevant to FPGAs covertwo main design flows:

1. High-level language (HLL) synthesis.HLL synthesis covers algorithmic orbehavioral synthesis, which can pro-duce hardware circuits from C or C-like software languages. Variouspartner solutions take different pathsto converting a high-level designdescription into an FPGA implemen-tation. How this is done goes to theroot of the differences between thevarious ESL offerings.

You can use HLL synthesis for a vari-ety of use cases, including:

• Module generation. In this mode ofuse, the HLL compiler can convert afunctional block expressed in C (forexample, as a C subroutine) into acorresponding hardware block. The generated hardware block is then assimilated in the overall hardware/software design. In this way,the HLL compiler generates a sub-module of the overall design.

Module generation allows softwareengineers to participate in the overallsystem design by quickly generating,then integrating, algorithmic hardwarecomponents. Hardware engineers seek-

2. System modeling. System simulationsusing traditional RTL models can bevery slow for large designs, or whenprocessors are part of the completedesign. A popular emerging ESLapproach uses high-speed transaction-

level models, typically writtenin C++, to significantly

speed up system sim-ulations. ESL tools

provide you witha virtual plat-form-basedverificationenvironmentwhere you can

analyze andtune the func-

tional and per-formance attributes of

your design. This meansmuch earlier access to a vir-tual representation of thesystem, enabling greaterdesign exploration andwhat-if analysis. You can

evaluate and refine performance issuessuch as latency, throughput, and band-width, as well as alternativesoftware/hardware partitioning strate-gies. Once the design meets its perform-ance objectives, it can be committed toimplementation in silicon.

ing a fast way to prototype new, com-putation-oriented hardware blockscan also use module generation.

• Processor acceleration. In thismode of use, the HLL compilerallows time-critical or bottle-neck functions runningon a processor to beaccelerated byenabling the cre-ation of a cus-tom acceleratorblock in theprogrammablefabric of theFPGA. In addi-tion to creatingthe accelerator, thetools can also auto-matically infer memoriesand generate the requiredhardware-software interfacecircuitry, as well as thesoftware device drivers thatenable communicationbetween the processor andthe hardware accelerator block(Figure 2). When compared tocode running on a CPU, FPGA-accelerated code can run orders ofmagnitude faster while consumingsignificantly less power.


HLL Synthesis

• No need to learn HDL• No prior FPGA experience needed

FPGA

CPUCapture Design

in "HLL"

External CPU

• Create hardware modules from software code• Accelerate "slow" CPU code in hardware

Empowering The Software Developer

CPU(Internal orExternal to

FPGA)

Memory(PCB, FPGA)

ESL Tools Can ...

Accelerator(FPGA)

APU FSL PLB OPB

LCD(PCB)

Infer Memories

Create CoprocessorInterface

Synthesize C intoFPGA Gates

Allow Direct Access fromC to PCB Components

Figure 1 – Most of theESL tools for FPGAs aretargeted at a software-

centric user base.

Figure 2 – ESL tools abstract the details associated with accelerating processor applications in the FPGA.

The Challenges Faced by ESL Tool Providers In relative terms, ESL tools for FPGAs arenew to the market; customer adoptionremains a key challenge. One of the biggestchallenges faced by ESL tool providers isovercoming a general lack of awareness asto what is possible with ESL and FPGAs,what solutions and capabilities alreadyexist, and the practical uses and benefits ofthe technology. Other challenges includeuser apprehension and concerns over thequality of results and learning curve associ-ated with ESL adoption.

Although paradigm shifts such asthose introduced by ESL will take time tobecome fully accepted within the existingFPGA user community, there is a need totackle some of the key issues that cur-rently prohibit adoption. This is particu-larly important because today’s ESLtechnologies are ready to deliver substan-tial practical value to a potentially largetarget audience.

Xilinx ESL InitiativeXilinx believes that ESL tools have thepromise and potential to radically changethe way hardware and software designerscreate, optimize, and verify complex elec-tronic systems. To bring the full range ofbenefits of this emerging technology to itscustomers and to establish a common plat-form for ESL technologies that targetFPGAs in particular, Xilinx has proactivelyformed a collaborative joint ESL Initiativewith its ecosystem partners (Table 1).

The overall theme of the initiative is toaccelerate the pace of ESL innovation forFPGAs and to bring the technology closerto the needs of the software-centric userbase. As part of the initiative, there are twomain areas of emphasis:

1. Engineering collaboration. Xilinx willwork closely with its partners to con-tinue to further increase the value ofESL product offerings. This will

include working to improve the com-piler quality of results and enhance toolinteroperability and overall ease-of-use.

2. ESL awareness and evangelization.Xilinx will evangelize the value andbenefits of ESL flows for FPGAs tocurrent and prospective new customers.The program will seek to inform andeducate users on the types of ESL solu-tions that currently exist and how thevarious offerings can provide betterapproaches to solving existing prob-lems. The aim is to empower users tomake informed decisions on the suit-ability and fit of various partner ESLofferings to meet their specific applica-tion needs. Greater awareness will leadto increased customer adoption, whichin turn will contribute to a sustainablepartner ESL for FPGAs ecosystem.

Getting Started With ESLAs a first step to building greater awarenesson the various ESL for FPGA efforts, Xilinxhas put together a comprehensive ESL web-site. The content covers the specific andunique aspects of each of the currently

available partner ESL solutions and isdesigned to help you decide which, if any,of the available solutions are a good fit foryour applications. To get started with yourESL orientation, visit www.xilinx.com/esl.

Additionally, Xilinx has also started a newESL for FPGAs discussion forum athttp://toolbox.xilinx.com/cgi-bin/forum. Here,you can participate in a variety of discussionson topics related to ESL design for FPGAs.

ConclusionESL tools for FPGAs give you the power toexplore your ideas with programmablehardware without needing to learn low-level details associated with hardwaredesign. Today, you have the opportunity toselect from a wide spectrum of innovativeand productivity-enhancing solutions thathave been specifically optimized for XilinxFPGAs. With the formal launching of theESL Initiative, Xilinx is thoroughly com-mitted to working with its third-partyecosystem in bringing the best-in-classESL tools to its current and potentialfuture customers. Stay tuned for continu-ing updates and new developments.


Handel-C, SystemC to gates

Impulse C to gates

HW/SW partitioning, acceleration

Co-processor synthesis

C to multi-core processing

Adaptable parallel processor in FPGA

SystemC to gates

SystemVerilog-based synthesis to RTL

High-performance computing

Celoxica

Impulse

Poseidon

Critical Blue

Teja

Mitrion

System Crafter

Bluespec

Nallatech

Partner FPGA Synthesis Xilinx CPU Support FPGA Computing Solution

... today’s ESL technologies are ready to deliver substantialpractical value to a potentially large target audience.

Table 1 – Xilinx ESL partners take different approaches from high-level languages to FPGA implementation.

by Glenn SteinerSr. Engineering Manager, Advanced Products Division Xilinx, Inc. [email protected]

Kunal ShenoyDesign Engineer, Advanced Products DivisionXilinx, [email protected]

Dan IsaacsDirector of Embedded Processing, Advanced ProductsDivisionXilinx, [email protected]

David PellerinChief Technology OfficerImpulse Accelerated [email protected]

Today’s designers are constrained by space,power, and cost, and they simply cannotafford to implement embedded designswith gigahertz-class computers. Fortunately,in embedded systems, the greatest compu-tational requirements are frequently deter-mined by a relatively small number ofalgorithms. These algorithms, identifiedthrough profiling techniques, can be rapidly

converted into hardware coprocessors usingdesign automation tools. The coprocessorscan then be efficiently interfaced to theoffloaded processor, yielding “gigahertz-class” performance.

In this article, we’ll explore code accel-eration and techniques for code conversionto hardware coprocessors. We will alsodemonstrate the process for making trade-off decisions with benchmark data throughan actual image-rendering case studyinvolving an auxiliary processor unit(APU)-based technique. The design usesan immersed PowerPC™ implemented ina platform FPGA.

The Value of a CoprocessorA coprocessor is a processing element thatis used alongside a primary processing unitto offload computations normally per-formed by the primary processing unit.Typically, the coprocessor function imple-mented in hardware replaces several soft-ware instructions. Code acceleration isthus achieved by both reducing multiplecode instructions to a single instruction aswell as the direct implementation of theinstruction in hardware.

The most frequently used coproces-sor is the floating-point unit (FPU), theonly common coprocessor that is tight-ly coupled to the CPU. There are nogeneral-purpose libraries of coproces-sors. Even if there were, it is still diffi-cult to readily couple a coprocessor to aCPU, such as a Pentium 4.

As shown in Figure 1, the Xilinx®

Virtex™-4 FX FPGA has one or twoPowerPCs, each with an APU interface. Byembedding a processor within an FPGA,you now have the opportunity to implement complete processing systemsof your own design within a single chip.

The integrated PowerPC with APUinterface enables a tightly coupledcoprocessor that can be implementedwithin the FPGA. Frequency require-ments and pin number limits make anexternal coprocessor less capable. Thus,you can now create application-specificcoprocessors attached directly to thePowerPC, providing significant soft-ware acceleration. Because FPGAs arereprogrammable, you can rapidly devel-op and test CPU-attached coprocessorsolutions.

Algorithmic Acceleration Through AutomatedGeneration of FPGA Coprocessors


C-to-FPGA design methods allow rapid creation of hardware-accelerated embedded systems.

Coprocessor Connection ModelsCoprocessors are available in three basicforms: CPU bus connected, I/O connect-ed, and instruction-pipeline connected.Mixed variants also exist.

CPU Bus ConnectionProcessor bus-connected acceleratorsrequire the CPU to move data and sendcommands through a bus. Typically, asingle data transaction can require manyprocessor cycles. Data trans-actions can be hindered bybus arbitration and thenecessity for the bus to beclocked at a fraction of theprocessor clock speed. A bus-connected accelerator caninclude a direct memoryaccess (DMA) engine. At thecost of additional logic, theDMA engine allows acoprocessor to operate onblocks of data located onbus-connected memory,independent of the CPU.

I/O ConnectionI/O-connected accelerators areattached directly to a dedicatedI/O port. Data and control aretypically provided throughGET or PUT functions.Lacking arbitration, reduced control com-plexity, and fewer attached devices, theseinterfaces are typically clocked faster than aprocessor bus. A good example of such aninterface is the Xilinx Fast Simplex Link(FSL). The FSL is a simple FIFO interfacethat can be attached to either the XilinxMicroBlaze™ soft-core processor or aVirtex-4 FX PowerPC. Data movementthrough the FSL has lower latency and ahigher data rate than data movementthrough a processor bus interface.

Instruction Pipeline Connection Instruction-pipeline connected acceleratorsattach directly to the computing core of aCPU. Being coupled to the instructionpipeline, instructions not recognized by theCPU can be executed by the coprocessor.Operands, results, and status are passed

PowerPC. Although the APU connection isinstruction-pipeline-based, the C-to-HDLtool set implements an I/O pipeline inter-face with a resulting behavior more typicalof an I/O-connected accelerator.

FPGA/PowerPC/APU InterfaceFPGAs allow hardware designers to imple-ment a complete computing system withprocessor, decode logic, peripherals, andcoprocessors all on one chip. An FPGA can

contain a few thousand to hun-dreds of thousands of logic cells.A processor can be implementedfrom the logic cells, as in theXilinx PicoBlaze™ or MicroBlazeprocessors, or it can be one ormore hard logic elements, as inthe Virtex-4 FX PowerPC. Thehigh number of logic cellsenables you to implement data-processing elements that workwith the processor system andare controlled or monitored bythe processor.

FPGAs, being reprogramma-ble elements, allow you to pro-gram parts and test them at anystage during the design process.If you find a design flaw, youcan immediately reprogram apart. FPGAs also allow you toimplement hardware computing

functions that were previously cost-pro-hibitive. The tight coupling of a CPUpipeline to FPGA logic, as in the Virtex-4FX PowerPC, enables you to create high-performance software accelerators.

Figure 2 is a block diagram showing thePowerPC, integrated APU controller, andan attached coprocessor. Instructions fromcache or memory are simultaneously pre-sented to the CPU decoder and the APUcontroller. If the CPU recognizes theinstruction, it is executed. If not, the APUcontroller or the user-created coprocessorhas the opportunity to acknowledge theinstruction and execute it. Optionally, oneor two operands can be passed to thecoprocessor and a result or status can bereturned. The APU interface also supportsthe ability to transfer a data element with asingle instruction. The data element ranges

directly to and from the data executionpipeline. A single operation can result intwo operands being processed, with both aresult and status being returned.

As a directly connected interface, theinstruction-pipeline connected acceleratorscan be clocked faster than a processor bus.The Xilinx implementation for this type ofcoprocessor connection model through theAPU interface demonstrates a 10x clockcycle reduction in the control and movement

of data for a typical double-operand instruc-tion. The APU controller is also connectedto the data-cache controller and can performdata load/store operations through it. Thus,the APU interface is capable of moving hun-dreds of millions of bytes per second,approaching DMA speeds.

Either I/O-connected accelerators orinstruction-pipeline-connected acceleratorscan be combined with bus-connected accel-erators. At the cost of additional logic, youcan create an accelerator that receives com-mands and returns status through a fast, low-latency interface while operating on blocks ofdata located in bus-connected memory.

The C-to-HDL tool set described in thisarticle is capable of implementing bus-con-nected and I/O-connected accelerators. It isalso capable of implementing an acceleratorconnected to the APU interface of the


EMAC

EMAC

PowerPC405 Core

APUControl

ResetDebug

FCM

Figure 1 – Virtex-4 FX processor with APU interface and EMAC blocks

in size from one byte to four 32-bit words.One or more coprocessors can be

attached to the APU interface through afabric coprocessor bus (FCB).Coprocessors attached to the bus rangefrom off-the-shelf cores, such as an FPU,to user-created coprocessors. A coproces-sor can connect to the FCB for controland status operations and to a processor

bus, enabling direct access to memorydata blocks and DMA data passing. Asimplified connection scheme, such as theFSL, can also be used between the FCBand coprocessor, enabling FIFO data andcontrol communication at the cost ofsome performance.

To demonstrate the performanceadvantage of an instruction-pipeline-con-

nected accelerator, we first implemented adesign with a processor bus-connectedFPU and then with an APU/FCB-connected FPU. Table 1 summarizes theperformance for a finite impulse response(FIR) filter for each case.

As noted in the table, an FPU con-nected to an instruction pipeline accel-erates software floating-pointoperations by 30x, and the APU inter-face provides a nearly 4x improvementover a bus-connected FPU.

Converting C Code to HDLConverting C code to an HDLaccelerator with a C-to-HDL tool isan efficient method for creatinghardware coprocessors. Figure 3 andthe steps below summarize the C-to-HDL conversion process:

1. Implement the application oralgorithm using standard Ctools. Develop a software testbench for baseline performanceand correctness (host or desktopsimulations). Use a profiler (suchas gprof ) to begin identifyingcritical functions.

2. Determine if floating-to-fixed pointconversion is appropriate. Use librariesor macros to aid in this conversion. Usea baseline test bench to analyze per-formance and accuracy. Use the profilerto reevaluate critical functions.

3. Using a C-to-HDL tool, such asImpulse C, iterate on each of the criti-cal functions to:

• Partition the algorithm into parallelprocesses

• Create hardware/software processinterfaces (streams, shared memories,signals)

• Automatically optimize and paral-lelize the critical code sections (suchas inner code loops)

• Test and verify the resulting parallelalgorithm using desktop simulation,cycle-accurate C simulation, andactual in-system testing.

4. Using the C-to-HDL tool, convert thecritical code segment to an HDLcoprocessor.

5. Attach the coprocessor to the APUinterface for final testing.


Implementation Performance

Software Implementation 2 MFLOPS

FPU Connected to Processor Bus 16 MFLOPS

FPU Connected to APU Interface via FCB 60 MFLOPS

Fetch Stage

DecodeControl

PipelineControl

Register File

Buffers andSynchronization

DecodeRegisters

Processor BlockInstructions fromCache or Memory

Decode StageInstruction

Reset

LoadData

Operands

Control

EXE Stage

Write Back (WB)Stage

Load WB Stage

PPC405 APU Controller

Fabric CoprocessorModule (FCM)

Decode

Execution Units

APU Decode Optional Decode

ExecutionUnits

Figure 3 – C-to-HDL design flow

Table 1 – Non-accelerated vs. accelerated floating-point performance

Figure 2 – PowerPC, integrated APU controller, and coprocessor

Impulse: C-to-HDL ToolImpulse C, shown in Figure 4, enablesembedded system designers to createhighly parallel, FPGA-accelerated applica-tions by using C-compatible library func-tions in combination with the Impulse

CoDeveloper C-to-hardware compiler.Impulse C simplifies the design of mixedhardware/software applications throughthe use of well-defined data communica-tion, message passing, and synchroniza-tion mechanisms. Impulse C providesautomated optimization of C code (suchas loop pipelining, unrolling, and opera-tor scheduling) and interactive tools,allowing you to analyze cycle-by-cyclehardware behavior.

Impulse C is designed for dataflow-ori-ented applications, but it is also flexibleenough to support alternate programmingmodels, including the use of shared mem-ory. This is important because differentFPGA-based applications have differentperformance and data requirements. In

some applications, it makes more sense tomove data between the embedded proces-sor and the FPGA through block memoryreads and writes; in other cases, a stream-ing communication channel might pro-vide higher performance. The ability to

quickly model, compile, and evaluatealternate algorithm approaches is animportant part of achieving the best pos-sible results for a given application.

To this end, the Impulse C librarycomprises minimal extensions to the Clanguage in the form of new data typesand predefined function calls. UsingImpulse C function calls, you can definemultiple, parallel program segments(called processes) and describe their inter-connections using streams, signals, andother mechanisms. The Impulse C com-piler translates and optimizes these C-lan-guage processes into either:

• Lower-level HDL that can be synthe-sized to FPGAs, or

• Standard C (with associated librarycalls) that can be compiled onto sup-ported microprocessors through the useof widely available C cross-compilers

The complete CoDeveloper develop-ment environment includes desktop simu-lation libraries compatible with standard Ccompilers and debuggers, includingMicrosoft Visual Studio and GCC/GDB.Using these libraries, Impulse C program-mers are able to compile and execute theirapplications for algorithm verification anddebugging purposes. C programmers arealso able to examine parallel processes, ana-lyze data movement, and resolve process-to-process communication problems usingthe CoDeveloper Application Monitor.

The output of an Impulse C applica-tion, when compiled, is a set of hardwareand software source files that are ready forimporting into FPGA synthesis tools.These files include:

• Automatically generated HDL filesrepresenting the compiled hardwareprocess.

• Automatically generated HDL files representing the stream, signal,and memory components needed toconnect hardware processes to a system bus.

• Automatically generated softwarecomponents (including a run-timelibrary) establishing the software sideof any hardware/software stream connections.

• Additional files, including script files,for importing the generated applica-tion into the target FPGA place androute environment.

The result of this compilation processis a complete application, including therequired hardware/software interfaces,ready for implementation on an FPGA-based programmable platform.


Impulse C is designed for dataflow-oriented applications, but it is also flexible enough to support alternate programming models ...

Figure 4 – Impulse C

Design ExampleThe Mandelbrot image shown in Figure 5,a classic example of fractal geometry, iswidely used in the scientific and engineer-ing communities to simulate chaoticevents such as weather. Fractals are also

used to generate textures and imaging invideo-rendering applications. Mandelbrotimages are described as self-similar; onmagnifying a portion of the image, anoth-er image similar to the whole is obtained.

The Mandelbrot image is an ideal can-didate for hardware/software co-designbecause it has a single computation-intensive function. Making this criticalfunction faster by moving it to the hard-ware domain significantly increases thespeed of the whole system. TheMandelbrot application also lends itselfnicely to clear divisions between hardwareand software processes, making it easy toimplement using C-to-HDL tools.

We used the CoDeveloper tool set as theC-to-HDL tool set for this design example.We modified a software-only Mandelbrot Cprogram to make it compatible with the C-to-HDL tools. Our changes included divi-sion of the software project into distinctprocesses (independent units of sequentialexecution); conversion of function interfaces

(hardware to software) into streams; and theaddition of compiler directives to optimizethe generated hardware. We subsequentlyused the CoDeveloper tool set to create thePcore coprocessor that was imported intoXilinx Platform Studio (XPS). Using XPS,

we attached the PC to the PowerPC APUcontroller interface and tested the system.

Xilinx application note XAPP901(www.xilinx.com/bvdocs/appnotes/xapp901.pdf) provides a full description of the designalong with design files for downloading.User Guide UG096 (www.xilinx.com/bvdocs/userguides/ug096.pdf) provides astep-by-step tutorial in implementing thedesign example.

Performance Improvement ExamplesWe measured performance improve-ments for the Mandelbrot image textur-ing problem, an image filteringapplication, and triple DES encryption.Table 2 documents the performanceimprovements, demonstrating accelera-tion ranging from 11x to 34x that of software.

ConclusionConstrained by power, space, and cost,you might need to make a non-idealprocessor choice. Frequently, it is achoice where the processor is of lowerperformance than desired. When thesoftware code does not run fast enough,a coprocessor code accelerator becomesan attractive solution. You can hand-craft an accelerator in HDL or use a C-to-HDL tool to automatically convertthe C code to HDL.

Using a C-to-HDL tool such asImpulse C enables quick and easy accel-erator generation. Virtex-4 FX FPGAs,with one or two embedded PowerPCs,enable tight coupling of the processorinstruction pipeline to software acceler-ators. As demonstrated in this article,critical software routines can be acceler-ated from 10x to more than 30x,enabling a 300 MHz PowerPC to pro-vide performance equaling or exceedingthat of a high-performance multi-giga-hertz processor. The above exampleswere generated in just a few days each,demonstrating the rapid design, imple-mentation, and testing possible with aC-to-HDL flow.

Application PowerPC Only PowerPC + Coprocessor Acceleration(300 MHz) (300/50 MHz)

Image Texturing (Mandelbrot/Fractal) 21 sec 1.2 sec 17x

Image Filter (Edge Detection) 0.14 sec 0.012 sec 11x

Encryption (Triple DES) 2.3 sec 0.067 sec 34x


Figure 5 – Mandelbrot image and code acceleration

Table 2 – Algorithm acceleration through coprocessor accelerators

by Rick Moleres Manager of Software IP Xilinx, [email protected]

Milan SainiTechnical Marketing ManagerXilinx, [email protected]

Platform FPGAs with embedded processorsoffer you unprecedented levels of flexibility,integration, and performance. It is now pos-sible to develop extremely sophisticated andhighly customized embedded systems insidea single programmable logic device.

With silicon capabilities advancing, thechallenge centers on keeping design methodsefficient and productive. In embedded sys-tems development, one of the key activities isthe development of the board support pack-age (BSP). The BSP allows an embeddedsoftware application to successfully initializeand communicate with the hardwareresources connected to the processor. TypicalBSP components include boot code, devicedriver code, and initialization code.

Creating a BSP can be a lengthy andtedious process that must be incurred everytime the microprocessor complex (proces-sor plus associated peripherals) changes.With FPGAs, fast design iterations com-bined with the inherent flexibility of theplatform can make the task of managingthe BSP even more challenging (Figure 1).This situation clearly underscores the needfor and importance of providing an effi-cient process for managing BSPs.

In this article, we’ll describe an innova-tive solution from Xilinx that simplifies thecreation and management of RTOS BSPs.We chose the WindRiver VxWorks flow toillustrate the concept; however, the under-lying technology is generic and equallyapplicable to all other OS solutions thatsupport Xilinx® processors.


Generating Efficient Board Support PackagesThe Xilinx Platform Studio toolset enables quick and easy BSP generation for Virtex FPGAs with immersed PowerPC processors.

Xilinx Design Flow and Software BSP GenerationDesigning for a Xilinx processor involves ahardware platform assembly flow and anembedded software development flow.Both of these flows are managed within theXilinx Platform Studio (XPS) tool, which ispart of the Xilinx Embedded DevelopmentKit (EDK).

You would typically begin a design byassembling and configuring the processorand its connected components in XPS.Once the hardware platform has beendefined, you can then configure the soft-ware parameters of the system. A key fea-ture of Platform Studio is its ability toproduce a BSP that is customized basedon your selection and configuration ofprocessor, peripherals, and embedded OS.As the system evolves through iterativechanges to the hardware design, the BSPevolves with the platform.

An automatically generated BSP enablesembedded system designers to:

• Automatically create a BSP that com-pletely matches the hardware design

• Eliminate BSP design bugs by usingpre-certified components

• Increase designer productivity byjump-starting application softwaredevelopment

Creating BSPs for WindRiver VxWorksPlatform Studio can generate a cus-tomized Tornado 2.0.x (VxWorks 5.4) orTornado 2.2.x (VxWorks 5.5) BSP for thePowerPC™ 405 processor and its periph-

erals in Xilinx Virtex™-II Pro andVirtex-4 FPGAs. The generated BSP con-tains all of the necessary support softwarefor a system, including boot code, devicedrivers, and VxWorks initialization.

Once a hardware system with thePowerPC 405 processor is defined inPlatform Studio, you need only follow thesethree steps to generate a BSP for VxWorks:

1. Use the Software Settings dialog box(see Figure 2) to select the OS youplan to use for the system. PlatformStudio users can select vxworks5_4or vxworks5_5 as their target operat-ing system.

2. Once you have selected the OS, youcan go to the Library/OS Parameterstab, as shown in Figure 3, to tailor theTornado BSP to the custom hardware.You have the option of selecting anyUART device in the system as thestandard I/O device (stdin and std-out). This results in the device beingused as the VxWorks console device.

You can also choose which peripheralsare connected peripherals, or whichdevices will be tightly integrated intothe VxWorks OS. For example, theXilinx 10/100 Ethernet MAC can beintegrated into the VxWorks EnhancedNetwork Driver (END) interface.Alternately, the Ethernet device neednot be connected to the END inter-face and can instead be accessed direct-ly from the VxWorks application.

3. Generate the Tornado BSP by select-ing the Tools > Generate Librariesand BSP menu option. The result-ing BSP resembles a traditionalTornado BSP and is located in thePlatform Studio project directoryunder ppc405_0/bsp_ppc405_0(see Figure 4).

Note that ppc405_0 refers to theinstance name of the PowerPC 405processor in the hardware design.Platform Studio users can specify a differ-ent instance name, in which case the sub-directory names for the BSP will matchthe processor instance name.


Traditional Embedded Platforms

Different

Platform FPGA

PLBArbiter

PLBArbiter

OPBArbiter

OPBArbiter

PLB-OPBBridge

PLB-OPBBridge

Hi-SpeedPeripherals

Hi-SpeedPeripherals

Low-SpeedPeripherals


MemoryController

MemoryController

CustomPeripherals

CustomPeripherals

•Design dependent peripherals •Each board is unique and custom•Need efficient custom BSP creation

•Fixed peripherals •Fixed address maps•Fixed BSP

Traditional Embedded Platforms

Different

Platform FPGA

PLBArbiter

PLBArbiter

OPBArbiter

OPBArbiter

PLB-OPBBridge

PLB-OPBBridge

Hi-SpeedPeripherals

Hi-SpeedPeripherals



MemoryController

MemoryController

CustomPeripherals

CustomPeripherals

PLBArbiter

PLBArbiter

OPBArbiter

OPBArbiter

PLB-OPBBridge

PLB-OPBBridge

Hi-SpeedPeripherals

Hi-SpeedPeripherals



MemoryController

MemoryController

CustomPeripherals

CustomPeripherals

• Design dependent peripherals • Each board is unique and custom• Need efficient custom BSP creation

• Fixed peripherals • Fixed address maps• Fixed BSP

Figure 1 – Platform FPGA flexibility requires the software BSP generation process to be efficient.

Figure 2 – Setting the embedded OS selection

Figure 3 – Configuring the OS-specific parameters

Figure 4 – Generated BSP directory structure

The Tornado BSP is completely self-contained and transportable to other direc-tory locations, such as the standardTornado installation directory for BSPs attarget/config.

Customized BSP DetailsThe XPS-generated BSP for VxWorksresembles most other Tornado BSPsexcept for the placement of Xilinx devicedriver code. Off-the-shelf device drivercode distributed with Tornado typicallyresides in the target/src/drv directory in theTornado distribution directory. Devicedriver code for a BSP that is automatical-ly generated by Platform Studio resides inthe BSP directory itself.

This minor deviation is due to thedynamic nature of FPGA-based embed-ded systems. Because an FPGA-basedembedded system can be reprogrammedwith new or changed IP, the device driverconfiguration can change, calling for amore dynamic placement of device driversource files. The directory tree for theautomatically generated BSP is shown inFigure 4. The Xilinx device drivers areplaced in the ppc405_0_drv_csp/xsrc sub-directory of the BSP.

Xilinx device drivers are implementedin C and are distributed among severalsource files, unlike traditional VxWorksdrivers, which typically consist of single Cheader and implementation files. In addi-tion, there is an OS-independent imple-mentation and an optional OS-dependentimplementation for device drivers.

The OS-independent part of the driveris designed for use with any OS or anyprocessor. It provides an application pro-gram interface (API) that abstracts the func-

tionality of the underlying hardware. TheOS-dependent part of the driver adapts thedriver for use with an OS such as VxWorks.Examples are Serial IO drivers for serialports and END drivers for Ethernet con-trollers. Only drivers that can be tightlyintegrated into a standard OS interfacerequire an OS-dependent driver.

Xilinx driver source files are included inthe build of a VxWorks image in the sameway that other BSP files are included inthe build. For every driver, a file exists

named ppc405_0_drv_<driver_version>.c inthe BSP directory. This file includes thedriver source files (*.c) for the givendevice and is automatically compiled bythe BSP makefile.

This process is analogous to howVxWorks’ sysLib.c includes source for WindRiver-supplied drivers. The reason whyXilinx driver files are not simply included insysLib.c like the rest of the drivers is becauseof namespace conflicts and maintainabilityissues. If all Xilinx driver files are part of asingle compilation unit, static functions anddata are no longer private. This placesrestrictions on the device drivers and wouldnegate their OS independence.

Integration with the Tornado IDE The automatically generated BSP is inte-grated into the Tornado IDE (ProjectFacility). The BSP is compilable from thecommand line using the Tornado maketools or from the Tornado Project. Oncethe BSP is generated, you can simply typemake vxWorks from the command line tocompile a bootable RAM image. Thisassumes that the Tornado environmenthas been previously set up, which you cando through the command line using thehost/x86-win32/bin/torVars.bat script (on aWindows platform). If you are using theTornado Project facility, you can create aproject based on the newly generated BSP,then use the build environment providedthrough the IDE to compile the BSP.

In Tornado 2.2.x, the diab compiler issupported in addition to the gnu compiler.The Tornado BSP created by PlatformStudio has a makefile that you can modifyat the command line if you would ratheruse the diab compiler instead of the gnucompiler. Look for the make variablenamed TOOLS and set the value to “diab”instead of “gnu.” If using the TornadoProject facility, you can select the desiredcompiler when the project is first created.

The file 50ppc405_0.cdf resides in the BSPdirectory and is tailored during creation of


The Tornado BSP created by Platform Studio has a makefilethat you can modify at the command line if you would rather

use the diab compiler instead of the gnu compiler.

Figure 5 – Tornado 2.x Project: VxWorks tab Figure 6 – Tornado 2.x Project: Files tab


the BSP. This file integrates the device driversinto the Tornado IDE menu system. Thedrivers are hooked into the BSP at theHardware > Peripherals subfolder. Below thisare individual device driver folders. Figure 5shows a menu with Xilinx device drivers.

The Files tab of the Tornado ProjectFacility will also show the number of filesused to integrate the Xilinx device driversinto the Tornado build process. These filesare automatically created by PlatformStudio and you need only be aware thatthe files exist. Figure 6 shows an exampleof the driver build files.

Some of the commonly used devicesare tightly integrated with the OS, whileother devices are accessible from the appli-cation by directly using the device drivers.The device drivers that have been tightlyintegrated into VxWorks include:

• 10/100 Ethernet MAC

• 10/100 Ethernet Lite MAC

• 1 Gigabit Ethernet MAC

• 16550/16450 UART

• UART Lite

• Interrupt Controller

• System ACE™ technology

• PCI

All other devices and associated devicedrivers are not tightly integrated into aVxWorks interface; instead, they are loose-ly integrated. Access to these devices isavailable by directly accessing the associat-ed device drivers from the user application.

ConclusionWith the popularity and usage of embeddedprocessor-based FPGAs continuing to grow,tool solutions that effectively synchronizeand tie the hardware and software flowstogether are key to helping designer produc-tivity keep pace with advances in silicon.

Xilinx users have been very positiveabout Platform Studio and its integrationwith VxWorks 5.4 and 5.5. Xilinx fullyintends to continue its development sup-port for the Wind River flow that willsoon include support for VxWorks 6.0and Workbench IDE.

Microprocessor Library Definition (MLD) The technology that enables dynamic and custom BSP generation is based on aXilinx proprietary format known as Microprocessor Library Definition (MLD). Thisformat provides third-party vendors with a plug-in interface to Xilinx PlatformStudio to enable custom library and OS-specific BSP generation (see Figure 7). TheMLD interface is typically written by third-party companies for their specific flows.It enables the following add-on functionality:

• Enables custom design rule checks

• Provides the ability to customize device drivers for the target OS environment

• Provides the ability to custom-produce the BSP in a format and folder structuretailored to the OS tool chain

• Provides the ability to customize an OS/kernel based on the hardware systemunder consideration

The MLD interface is an ASCII-basedopen and published standard. EachRTOS flow will have its own set ofunique MLD files. An MLD file set com-prises the following two files:

• A data definition (.mld) file. This filedefines the library or operating sys-tem through a set of parameters setby the Platform Studio. The values ofthese parameters are stored in aninternal Platform Studio databaseand intended for use by the script fileduring the output generation.

• A .tcl script file. This is the file thatis called by XPS to create the customBSP. The file contains a set of proce-dures that have access to the complete design database and hence can write acustom output format based on the requirements of the flow.

The MLD syntax is described in detail in the EDK documentation (see “PlatformSpecification Format Reference Manual” at www.xilinx.com/ise/embedded/psf_rm.pdf.). You can also find MLD examples in the EDK installation directoryunder sw/lib/bsp.

Once MLD files for a specific RTOS flow have been created, they need to beinstalled in a specific path for Xilinx Platform Studio to pick up on its next invoca-tion. The specific RTOS menu selection now becomes active in the XPS dialog box(Project > SW Platform Settings > Software Platform > OS).

Currently, the following partners’ MLD files are available for use within XPS:

• Wind River (VxWorks 5.4, 5.5) (included in Xilinx Platform Studio)

• MontaVista (Linux) (included in Xilinx Platform Studio)

• Mentor Accelerated Technologies (Nucleus) (download fromwww.xilinx.com/ise/embedded/mld/)

• GreenHills Software (Integrity) (download fromwww.xilinx.com/ise/embedded/mld/)

• Micrium (µc/OS-II) (download from www.xilinx.com/ise/embedded/mld/)

• µcLinux (download from www.xilinx.com/ise/embedded/mld/).

HWDesign

OSSelection

HWNetlist

RTOSBSP

XPSMLDFiles

.MLD .TCL

To ISE To RTOS IDE

Figure 7 – Structure of an MLD flow

©2006 Mentor Graphics Corporation. All Rights Reserved.

As an embedded software developer, you’re always facing the next deadline. We know it’s important to get your products tomarket before your competitors, and we can help. With our Eclipse-based development tools, tightly integrated embeddedsoftware and support that is second to none, we offer you a partner to get your product to market quickly and easily.

The EDGE Eclipse-based development environment provides a set of top-notch development tools in the industry today.You’ll see how quickly you can code, collaborate on and deliver your final product. Additionally, the Nucleus range of royalty-free RTOS and middleware products gives you a proven kernel with everything else you need in a modern OS. Open,available, affordable solutions.

Finally, our Customer Support has one goal: provide the most experienced, timely and one-on-one customer support in the industry.As the only five-time recipient of the Software Technical Assistance Recognition (STAR) Award for technical supportexcellence and global support center practices certified by the Support Center Practices (SCP), we are dedicated to yoursuccess.

For a free evaluation of EDGE Development Tools, visit our website

Mentor.com/embedded

or e-mail [email protected] made easy.

Some thingshave

to go outon time

Visit us at

REGISTER TODAY FOR AVNET®

SPEEDWAY DESIGN WORKSHOP“Software Development with Xilinx Embedded Processors” Wed, April 5 and Thurs, April 6 (Room K)

www.em.avnet.com

BOOTH 1208

www.xilinx.com/epq With Embedded Processing QuickStart!, Xilinx offers on-site training

and support right from the start. Our solution includes a dedicated expert

application engineer for one week, to train your hardware team on creating

embedded systems, instruct software engineers how to best use supporting

FPGA features, and help your team finish on time and on budget.

The Quicker Solution to Embedded Design

QuickStart! features include configuration of the Xilinx ISE™ design envi-

ronment, an embedded processing and EDK (Embedded Development Kit)

training class, and design architecture and implementation consultations

with Xilinx experts. Get your team started today!

Contact your Xilinx representative or go to www.xilinx.com/epq for

more information.

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc.All other trademarks are the property of their respective owners.

Xilinx Productivity Advantage: Embedded Processing QuickStart!

Getting You Started With

On-Site Embedded Training

by Geir KjosavikSenior Staff Product Marketing Engineer, Embedded Processing DivisionXilinx, [email protected]

Inside microprocessors, numbers are rep-resented as integers – one or several bytesstringed together. A four-byte value com-prising 32 bits can hold a relatively largerange of numbers: 232, to be specific. The32 bits can represent the numbers 0 to 4,294,967,295 or, alternatively, -2,147,483,648 to +2,147,483,647. A 32-bitprocessor is architected such that basic arith-metic operations on 32-bit integer numberscan be completed in just a few clock cycles,and with some performance overhead a 32-bit CPU can also support operations on 64-bit numbers. The largest value that can berepresented by 64 bits is really astronomical:18,446,744,073,709,551,615. In fact, if aPentium processor could count 64-bit values at a frequency of 2.4 GHz, it wouldtake it 243 years to count from zero to themaximum 64-bit integer.

Bringing Floating-Point Math to the MassesBringing Floating-Point Math to the Masses


Xilinx makes high-performance floating-point processing available to a wider range of applications.Xilinx makes high-performance floating-point processing available to a wider range of applications.

Dynamic Range and Rounding Error ProblemsConsidering this, you would think thatintegers work fine, but that is not always thecase. The problem with integers is the lackof dynamic range and rounding errors.

The quantization introduced through afinite resolution in the number format dis-torts the representation of the signal.However, as long as a signal is utilizing therange of numbers that can be represented byinteger numbers, also known as the dynam-ic range, this distortion may be negligible.

Figure 1 shows what a quantized signallooks like for large and small dynamicswings, respectively. Clearly, with thesmaller amplitude, each quantization stepis bigger relative to the signal swing andintroduces higher distortion or inaccuracy.

The following example illustrates howinteger math can mess things up.

A Calculation Gone BadAn electronic motor control measures thevelocity of a spinning motor, which typical-ly ranges from 0 to10,000 RPM. The valueis measured using a 32-bit counter. To allowsome overflow margin, let’s assume that themeasurement is scaled so that 15,000 RPMrepresents the maximum 32-bit value,4,294,967,296. If the motor is spinning at105 RPM, this value corresponds to thenumber 30,064,771 within 0.0000033%,which you would think is accurate enoughfor most practical purposes.

Assume that the motor control isinstructed to increase motor velocity by0.015% of the current value. Because weare operating with integers, multiplyingwith 1.0015 is out of the question – as ismultiplying by 10,015 and dividing by10,000 – because the intermediate resultwill cause overflow.

The only option is to divide by integer10,000 and multiply by integer 10,015. Ifyou do that, you end up with 30,094,064;but the correct answer is 30,109,868.

1.001 can be written as 1.001 x 100.In the first example, 3.0064771 is called

the mantissa, 10 the exponent base, and 7the exponent.

IEEE standard 754 specifies a commonformat for representing floating-pointnumbers in a computer. Two grades of pre-cision are defined: single precision anddouble precision. The representations use32 and 64 bits, respectively. This is shownin Figure 2.

In IEEE 754 floating-point representa-tion, each number comprises three basiccomponents: the sign, the exponent, andthe mantissa. To maximize the range ofpossible numbers, the mantissa is dividedinto a fraction and leading digit. As I’llexplain, the latter is implicit and left out ofthe representation.

The sign bit simply defines the polarityof the number. A value of zero means thatthe number is positive, whereas a 1 denotesa negative number.

The exponent represents a range of num-bers, positive and negative; thus a bias valuemust be subtracted from the stored exponentto yield the actual exponent. The single pre-cision bias is 127, and the double precisionbias is 1,023. This means that a stored valueof 100 indicates a single-precision exponentof -27. The exponent base is always 2, andthis implicit value is not stored.

Because of the truncation that happens whenyou divide by 10,000, the resulting velocityincrease is 10.6% smaller than what youasked for. Now, an error of 10.6% of 0.015%may not sound like anything to worry about,but as you continue to perform similaradjustments to the motor speed, these errorswill almost certainly accumulate to a pointwhere they become a problem.

What you need to overcome this prob-lem is a numeric computer representationthat represents small and large numberswith equal precision. That is exactly whatfloating-point arithmetic does.

Floating Point to the RescueAs you have probably guessed, floating-point arithmetic is important in industrialapplications like motor control, but also ina variety of other applications. An increas-ing number of applications that traditional-ly have used integer math are turning tofloating-point representation. I’ll discussthis once we have looked at how floating-point math is performed inside a computer.

IEEE 754 at a GlanceA floating-point number representation ona computer uses something similar to a sci-entific notation with a base and an expo-nent. A scientific representation of30,064,771 is 3.0064771 x 107, whereas


1 LSB

1 LSB

63 55 47 39 31 23 15 7 0S Exp. [30-23] Fraction [22-0] Single precision

S Exponent [62-52] Fraction [51-0] Double precision

Figure 1 – Signal quantization and dynamic range

Figure 2 – IEEE floating-point formats

For both representations, exponent rep-resentations of all 0s and all 1s are reservedand indicate special numbers:

• Zero: all digits set to 0, sign bit can beeither 0 or 1

• ±∞: exponent all 1s, fraction all 0s

• Not a Number (NaN): exponent all 1s,non-zero fraction. Two versions of NaNare used to signal the result of invalidoperations such as dividing by zero, andindeterminate results such as operationswith non-initialized operand(s).

The mantissa represents the number to bemultiplied by 2 raised to the power of theexponent. Numbers are always normalized;that is, represented with one non-zero lead-ing digit in front of the radix point. In bina-ry math, there is only one non-zero number,1. Thus the leading digit is always 1, allow-ing us to leave it out and use all the mantissabits to represent the fraction (the decimals).

Following the previous number exam-ples, here is what the single precision repre-sentation of the decimal value 30,064,771will look like:

The binary integer representation of30,064,771 is 1 1100 1010 1100 00001000 0011. This can be written as1.110010101100000010000011 x 224. Theleading digit is omitted, and the fraction –the string of digits following the radixpoint – is 1100 1010 1100 0000 10000011. The sign is positive and the exponentis 24 decimal. Adding the bias of 127 andconverting to binary yields an IEEE 754exponent of 1001 0111.

Putting all of the pieces together, thesingle representation for 30,064,771 isshown in Figure 3.

Gain Some, Lose SomeNotice that you lose the least significant bit(LSB) of value 1 from the 32-bit integerrepresentation – this is because of the lim-ited precision for this format.

The range of numbers that can be rep-

resented with single precision IEEE 754representation is ±(2-2-23) x 2127, or approx-imately ±1038.53. This range is astronomicalcompared to the maximum range of 32-bitinteger numbers, which by comparison islimited to around ±2.15 x 109. Also, where-as the integer representation cannot repre-sent values between 0 and 1,single-precision floating-point can repre-sent values down to ±2-149, or ±~10-44.85. Andwe are still using only 32 bits – so this hasto be a much more convenient way to rep-resent numbers, right?

The answer depends on the requirements.

• Yes, because in our example of multi-plying 30,064,771 by 1.001, we cansimply multiply the two numbers andthe result will be extremely accurate.

• No, because as in the preceding exam-ple the number 30,064,771 is not rep-resented with full precision. In fact,30,064,771 and 30,064,770 are repre-sented by the exact same 32-bit bitpattern, meaning that a software algo-rithm will treat the numbers as identi-cal. Worse yet, if you increment eithernumber by 1 a billion times, none ofthem will change. By using 64 bits andrepresenting the numbers in doubleprecision format, that particular exam-ple could be made to work, but evendouble-precision representation willface the same limitations once thenumbers get big – or small enough.

• No, because most embedded processorcores ALUs (arithmetic logic units)only support integer operations, whichleaves floating-point operations to beemulated in software. This severelyaffects processor performance. A 32-bitCPU can add two 32-bit integers withone machine code instruction; howev-er, a library routine including bitmanipulations and multiple arithmeticoperations is needed to add two IEEEsingle-precision floating-point values.

With multiplication and division, theperformance gap just increases; thusfor many applications, software float-ing-point emulation is not practical.

Floating Point Co-Processor UnitsFor those who remember PCs based onthe Intel 8086 or 8088 processor, theycame with the option of adding a float-ing-point coprocessor unit (FPU), the8087. Though a compiler switch, youcould tell the compiler that an 8087 waspresent in the system. Whenever the8086 encountered a floating-point oper-ation, the 8087 would take over, do theoperation in hardware, and present theresult on the bus.

Hardware FPUs are complex logic cir-cuits, and in the 1980s the cost of theadditional circuitry was significant; thusIntel decided that only those who neededfloating-point performance would have topay for it. The FPU was kept as an option-al discrete solution until the introductionof the 80486, which came in two versions,one with and one without an FPU. Withthe Pentium family, the FPU was offeredas a standard feature.

Floating Point is Gaining GroundThese days, applications using 32-bitembedded processors with far less process-ing power than a Pentium also require float-ing-point math. Our initial example ofmotor control is one of many – other appli-cations that benefit from FPUs are industri-al process control, automotive control,navigation, image processing, CAD tools,and 3D computer graphics, includinggames.

As floating-point capability becomesmore affordable and proliferated, applica-tions that traditionally have used integermath turn to floating-point representa-tion. Examples of the latter include high-end audio and image processing. Thelatest version of Adobe Photoshop, forexample, supports image formats whereeach color channel is represented by afloating-point number rather than theusual integer representation. The increaseddynamic range fixes some problems inher-ent in integer-based digital imaging.


Figure 3 – 30,064,771 represented in IEEE 754 single-precision format

If you have ever taken a picture of aperson against a bright blue sky, you knowthat without a powerful flash you are leftwith two choices; a silhouette of the per-son against a blue sky or a detailed faceagainst a washed-out white sky. A floating-point image format partly solves this prob-lem, as it makes it possible to representsubtle nuances in a picture with a widerange in brightness.

Compared to software emulation, FPUscan speed up floating-point math opera-tions by a factor of 20 to 100 (dependingon type of operation) but the availability ofembedded processors with on-chip FPUs islimited. Although this feature is becomingincreasingly more common at the higherend of the performance spectrum, thesederivatives often come with an extensiveselection of advanced peripherals and veryhigh-performance processor cores – fea-tures and performance that you have to payfor even if you only need the floating-pointmath capability.

FPUs on Embedded ProcessorsWith the MicroBlaze™ 4.00 processor,Xilinx makes an optional single precisionFPU available. You now have the choicewhether to spend some extra logic toachieve real floating-point performance orto do traditional software emulation andfree up some logic (20-30% of a typicalprocessor system) for other functions.

Why Integrated FPU is the Way to GoA soft processor without hardware supportfor floating-point math can be connected toan external FPU implemented on an

FPGA. Similarly, any microcontroller canbe connected to an external FPU. However,unless you take special considerations onthe compiler side, you cannot expect seam-less cooperation between the two.

C-compilers for CPU architecture fam-ilies that have no floating-point capabilitywill always emulate floating-point opera-tions in software by linking in the necessarylibrary routines. If you were to connect anFPU to the processor bus, FPU accesswould occur through specifically designeddriver routines such as this one:

void user_fmul(float *op1, float *op2, float *res)

{

FPU_operand1=*op1; /* write operand a to FPU */

FPU_operand2=*op2; /* write operand b to FPU */

FPU_operation=MUL; /* tell FPU to multiply */

while (!(FPU_stat & FPUready)); /* wait for FPU to finish */

*res = FPU_result /* return result */

}

To do the operation, z = x*y in the mainprogram, you would have to call the abovedriver function as:

float x, y, z;

user_fmul (&x, &y, &z);

For small and simple operations, thismay work reasonably well, but for complexoperations involving multiple additions,subtractions, divisions, and multiplica-tions, such as a proportional integral deriv-ative (PID) algorithm, this approach hasthree major drawbacks:

• The code will be hard to write, maintain, and debug

• The overhead in function calls willseverely decrease performance

• Each operation involves at least five bustransactions; as the bus is likely to beshared with other resources, this not onlyaffects performance, but the time neededto perform an operation will be depend-ent on the bus load in the moment

The MicroBlaze WayThe optional MicroBlaze soft processor withFPU is a fully integrated solution that offershigh performance, deterministic timing, andease of use. The FPU operation is completelytransparent to the user.

When you build a system with anFPU, the development tools automatical-ly equip the CPU core with a set of float-ing-point assembly instructions known tothe compiler.

To perform y = x*y, you would simply write:

float x, y, z;

y = x * z;

and the compiler will use those specialinstructions to invoke the FPU and per-form the operation.

Not only is this simpler, but a hardware-connected FPU guarantees a constant num-ber of CPU cycles for each floating-pointoperation. Finally, the FPU provides anextreme performance boost. Every basicfloating-point operation is accelerated by afactor 25 to 150, as shown in Table 1.

ConclusionFloating-point arithmetic is necessary to meetprecision and performance requirements foran increasing number of applications.

Today, most 32-bit embedded proces-sors that offer this functionality are deriva-tives at the higher end of the price range.

The MicroBlaze soft processor with FPUcan be a cost-effective alternative to ASSPproducts, and results show that with thecorrect implementation you can benefit notonly from ease-of-use but vast improve-ments in performance as well.

For more information on theMicroBlaze FPU, visit www.xilinx.com/ipcenter/processor_central/microblaze/microblaze_fpu.htm.


Operation CPU Cycles without FPU CPU Cycles with CPU Acceleration

Addition 400 6 67x

Subtraction 400 6 67x

Division 750 30 25x

Multiplication 400 6 67x

Comparison 450 3 150x

Table 1 – MicroBlaze floating-point acceleration

by Bryon MoyerVP, Product MarketingTeja Technologies, Inc. [email protected]

As the world gets connected, more and moresystems rely on access to the network as astandard part of product configuration.Traditional circuit-based communicationssystems like the telephone infrastructure aregradually moving towards packet-based tech-nology. Even technologies like AsynchronousTransfer Mode (ATM) are starting to yield tothe Internet Protocol (IP) in places. All ofthis has dramatically increased the need forpacket-processing technology.

The content being passed over thisinfrastructure has increased the demandson available bandwidth. Core routers target10 Gbps; edge and access equipment workin the 1-5 Gbps range. Even some end-userequipment is starting to break the 100Mbps range. The question is how to designsystems to accommodate these speeds.

These systems implement a wide varietyof network protocols. Because the protocolsstart out as software, it’s easiest for networkdesigners if as much of the functionality aspossible can remain in software. So the fur-ther software programmability can bepushed up the speed range, the better.Although FPGAs can handle networkspeeds as high as 10 Gbps, RTL has typical-ly been required for 1 Gbps and higher.


Packet Subsystem on a ChipPacket Subsystem on a ChipTeja’s packet-engine technology integrates all of the key aspects of a flexible packet processor.Teja’s packet-engine technology integrates all of the key aspects of a flexible packet processor.

Teja Technologies specializes in packet-processing technologies implemented inhigh-level software on multi-core environ-ments. Teja has adapted its technology toXilinx® Virtex™-4 FPGAs, allowing high-level software programmability of a packet-processing engine built out of multipleMicroBlaze™ soft-processor cores. Thiscombination of high-level packet technologyand Xilinx silicon and core technology –using Virtex-4 devices with on-board MACs,PHYs, PowerPC™ hard-core processors,and ample memory – provides a completepacket-processing subsystem that can processmore than 1 Gbps in network traffic.

The Typical Packet SubsystemThe network “stack” shown in Figure 1 istypically divided between the “controlplane” and the “data plane.” All of thepackets are handled in the data plane; thecontrol plane makes decisions on how thepackets should be processed. The lowestlayer sees every packet; higher layers willsee fewer packets.

The control plane comprises a hugeamount of sophisticated software. Thedata-plane software is simpler, but mustoperate at very high speed at the lowest lay-ers because it has such a high volume ofpackets. Packet-processing accelerationusually focuses on layers one to three of thenetwork stack, and sometimes layer four.

Most traffic that goes through the sys-tem looks alike, and processors can be opti-mized for that kind of traffic. For thisreason, data-plane systems are often divid-ed into the “fast path,” which handles aver-age traffic, and the “slow path,” whichhandles exceptions. Although the slow pathcan be managed by a standard RISCprocessor like a PowerPC, the fast pathusually uses a dedicated structure like a net-work processor or an ASIC. The focus ofthe fast path is typically IP, ATM, VLAN,and similar protocols in layers two andthree. Layer four protocols like TCP andUDP are also often accelerated.

Memory is required for packet storage,table storage, and for program and datastorage for both the fast and slow paths.Memory latency has a dramatic impacton speed, so careful construction of thememory architecture is paramount.

Finally, there must be a way for the con-trol plane to access the subsystem. This isimportant for initialization, making tablechanges, diagnostics, and other control func-tions. Such access is typically accomplishedthrough a combination of serial connectionsand dedicated Ethernet connections, eachrequiring logic to implement.

A diagram of this subsystem is shownin Figure 2; all of the pieces of this sub-system are critical to achieving the highestperformance.

The Teja Packet PipelineOne effective way to accelerate processing isto use a multi-core pipeline. This allows youto divide the functionality into stages andadd parallel elements as needed to hit per-formance. If you were to try to assemble sucha structure manually, you would immediate-

Of course, to process packets, theremust be a way to deliver the packets to andfrom the fast-path processor. Coming offan Ethernet port, the packets must first tra-verse the physical layer logic (layer one ofthe stack, often a dedicated chip) and thenthe MAC (part of layer two, also often itsown dedicated chip).

One of the most critical elements ingetting performance is the memory.


DataPlane

Control Plane

Application Layer7

4

Ethernet

3

2

1

Transport Layer(TCP, UDP, ...)

Network Layer (IP)

Data Link Layer

Physical Layer (PHY)

ControlAccess

Slow-Path,Control Plane

RISC Processor

Fast-Path ProcessorPort

Access

Memory

Figure 1 – The network protocol stack

Figure 2 – Typical packet-processing subsystem

Most traffic that goes through the system looks alike, and processors can be optimized for that kind of traffic.

ly encounter the kinds of challenges faced byexperienced multi-core designers: how tostructure communication between stages,scheduling, and shared resource access.

Teja has developed a pipeline structure bycreating its own blocks that implement the

necessary functions for efficient processingand inter-communication. By taking advan-tage of this existing infrastructure, you canassemble pipelines easily in a scalable fashion.

The pipeline comprises processingengines connected by communicationblocks and accessed through packet accessblocks. Figure 3 illustrates this arrangement.

The engine consists primarily of aMicroBlaze processor and some privateblock RAM on the FPGA. In addition, ifa stage has a particularly compute-inten-sive function like a checksum, or a longer-lead function like an external memoryread or write, an offload can be includedto accelerate that function. Because theoffload can be created as asynchronous ifdesired, the MicroBlaze processor is freeto work on something else while theoffload is operating.

The communication blocks manage thetransition from stage to stage. As packetinformation moves forward, the communi-cation block can perform load balancing orroute a packet to a particular engine.Although the direction of progress is usual-ly “forward” (left to right, as shown inFigure 3), there are times when a packetmust move backwards. An example of thisis with IPv4/v6 forwarding, when an IPv6packet is tunneled on IPv4. Once the IPv4packet is decapsulated, an internal IPv6packet is found, and it must go back forIPv6 decapsulation.

Access to the pipeline is provided by ablock that takes each packet and delivers thecritical parts to the pipeline. Because thisblock is in the critical path for every packet,it must be very fast, and has been designedby Teja for very high performance.

The result of this structure is that eachMicroBlaze processor and offload can beworking on a different packet at any giventime. High performance is achievedbecause many in-flight packets are beinghandled at once.

The key to this structure is its scalability.Anytime additional performance is needed,

you can add more parallel processing, orcreate another pipeline stage. The reverse isalso true: if a given pipeline provides moreperformance than the target systemrequires, you can remove engines, makingthe subsystem more economical.

The Rest of the SubsystemWhat is so powerful about the combina-tion of Teja’s data-plane engine and theVirtex-4 FX devices is that most of the restof the subsystem can be moved on-chip.Much of the external memory can now bemoved into internal block RAM. Someexternal memory will still be required, buthigh-speed DRAM can be directly accessedby the Virtex-4 family, so no interveningglue is required. The chips have built-inEthernet MACs which, combined with theavailable PHY IP and RocketIO™ tech-nology, allow direct access from Ethernetports onto the chip.

The integrated PowerPC cores (as manyas two) allow you to implement the slowpath and even the entire control plane onthe same chip over an embedded operatingsystem such as Linux. You can also provide


PortAccess

(TX)

PortAccess

(RX)

ProcessingEngine

ProcessingEngine

ProcessingEngine

Com

mun

icat

ion

IP

Com

mun

icat

ion

IP

Com

mun

icat

ion

IP

ProcessingEngine

Virtex-4 FX

Control Access

MAC PowerPCs(Slow Path,

Control Plane)

Fast-Path Pipeline

Mem

ory

(Blo

ck R

AM

)

Mem

ory

Con

trol

ler

Ext

erna

l DR

AM

Fast-Path Offloads

PHYMGT

UART

MACPHY

DedicatedSilicon

Software-Programmable

RTLBuilt Out ofLogic Fabric

MGT

Fast Path Processor

Figure 3 – Teja packet-processing pipeline

Figure 4 – Single-chip packet-processing subsystem


control access through serial and Ethernetports using existing IP.

As a result, the entire subsystem shownin Figure 2 (with the exception of someexternal memory) can be implemented ona single chip, as illustrated in Figure 4.

Flexibility: Customizing, Resizing, UpgradingTeja’s packet-processing infrastructure pro-vides access to our company’s real strength:providing data-plane applications that youcan customize. We deliver applicationssuch as packet forwarding, TCP, securegateways, and others with source code. Thereason for delivering source code is that if

you need to customize the operation of theapplication, you can alter the deliveredapplication using straight ANSI C. Eventhough you are using an FPGA, it is stillsoftware-programmable, and you candesign using standard software methods.

An application as delivered by Teja isguaranteed to operate at a given line rate.When you modify that application, howev-er, the performance may change. Teja’s scal-able infrastructure allows you to tailor theprocessor architecture to accommodate theperformance requirements in light ofchanged functionality.

In a non-FPGA implementation, if youcannot meet performance, then you typical-ly have to go to a much larger device, whichwill most likely be under-utilized (but costfull price). The beauty of FPGA implemen-tation is that the pipeline can be tweaked tobe just the right configuration, and only theamount of hardware required is used. Therest is available for other functions.

One of the most important aspects ofsoftware programmability is field upgrades.With a software upgrade, you can changeyour code – as long as you stay within theamount of code store available. As the TejaFPGA packet engine is software-program-mable, you can perform software upgrades.But because it uses an FPGA, you can alsoupgrade the underlying hardware in thefield. For example, if a software upgraderequires more code store than is available,you can make a hardware change to makemore code store available, and then thesoftware upgrade will be successful. Onlyan FPGA provides this flexibility.

Because a structure like this is typicallydesigned by high-level system designersand architects, it is important that ANSI Cis the primary language. At the lowest level,the hardware infrastructure, the mappingsbetween software and hardware, and thesoftware programs themselves are expressedin C. Teja has created an extensive set ofAPIs that allow both compile-time andreal-time access from the software to thevarious hardware resources. Additionaltools will simplify the task of implementingprograms on the pipeline.

IPv4 Forwarding Provides ProofTeja provides IPv4 and IPv6 forwarding asa complete data-plane application. IPv4 is arelatively simple application that can illus-trate the power of this packet engine. It isthe workhorse application under most ofthe Internet today. IPv6 is gradually gain-ing some ground, with its promise of plen-ty of IP addresses for the future, but for

now IPv4 still dominates. At its most basic,IPv4 comprises the following functions:

• Filtering

• Decapsulation

• Classification

• Validation

• Broadcast check

• Lookup/Next Hop calculation

• Encapsulation

Teja has implemented these in a two-stagepipeline, as shown in Figure 5. Offloads areused for the following functions:

• Checksum calculation

• Hash lookup

• Longest-prefix match

• Memory access

This arrangement provides full gigabitline-rate processing of a continuous streamof 64-byte packets, which is the most strin-gent Ethernet load.

ConclusionTeja Technologies has adapted its packet-processing technology to the Virtex-4 FXfamily, creating an infrastructure of IPblocks and APIs that take advantage ofVirtex-4 FX features. The high-level cus-tomizable applications that Teja offerscan be implemented using softwaremethodologies on a MicroBlaze multi-core fabric while achieving speeds higherthan a gigabit per second. Software pro-grammability adds to the flexibility andease of design already inherent in theVirtex family.

The flexibility of the high-level sourcecode algorithms is bolstered by the fact thatthe underlying hardware utilization can bespecifically tuned to the performancerequirements of the system. And oncedeployed, both software and hardwareupgrades are possible, dramatically extendingthe potential life of the system in the field.

Teja Technologies, the Virtex-4 FX fam-ily, and the MicroBlaze core provide a sin-gle-chip customizable, resizable, andupgradable packet-processing solution.

PortAccess

(TX)

PortAccess

(RX)

Stage 1Filtering

DecapsulationClassification

ValidationBroadcast Check

Stage 1

Stage 1

Stage 1

Stage 1

Stage 2Next Hop Lookup

Encapsulation

Com

mun

icat

ion

IP

Com

mun

icat

ion

IP

Com

mun

icat

ion

IP

Figure 5 – IPv4 forwarding engine

by John Williams, Ph.D. CEO PetaLogix [email protected]

Scott Thibault, Ph.D.PresidentGreen Mountain Computing Systems, [email protected]

David PellerinCTOImpulse Accelerated Technologies, [email protected]

FPGAs are compelling platforms for hard-ware acceleration of embedded systems.These devices, by virtue of their massivelyparallel structures, provide embedded sys-tems designers with new alternatives forcreating high-performance applications.

There are challenges to using FPGAs assoftware platforms, however. Historically,low-level hardware descriptions must be

written in VHDL or Verilog, languagesthat are not generally part of a softwareprogrammer’s expertise. Other challengeshave included deciding how and when topartition complex applications betweenhardware and software and how to struc-ture an application to take maximumadvantage of hardware parallelism.

Tools providing C compilation andoptimization for FPGAs can help solvethese problems by providing a new level ofprogramming abstraction. When FPGAsfirst appeared two decades ago, the pri-mary method of design for these deviceswas the venerable schematic. FPGA appli-cation developers used schematics toassemble low-level components (registers,logic gates, and larger blocks such as coun-ters and adders/subtractors) to createFPGA-based systems. As FPGA devicesbecame more complex and applicationstargeting them grew larger, schematicswere gradually replaced by higher level

methods involving hardware descriptionlanguages like VHDL and Verilog. Now,with ever-higher FPGA gate densities andthe proliferation of FPGA embeddedprocessors, there is strong demand foreven higher levels of abstraction. C repre-sents that next generation of abstraction,allowing you to access the resources ofFPGAs for application acceleration.

For applications that involve embeddedprocessors, a C-to-hardware tool such asImpulse C (Figure 1) can abstract awaymany of the details of hardware-to-soft-ware communication, allowing you tofocus on application partitioning withouthaving to worry about the low-level detailsof the hardware. This also allows you toexperiment with alternative software/hard-ware implementations.

Although such tools can dramaticallyimprove your ability to create FPGA-based applications, for the highest per-formance you still need to understand

Accelerating FFTs in Hardware Using a MicroBlaze Processor


A simple FFT, generated as hardware from C language, illustrates how quickly a software concept can betaken to hardware and how little you need to know about FPGAs to use them for application acceleration.

tively, through an analysis of how the appli-cation is being compiled to the hardwareand through the experimentation that C-language programming allows.

Graphical tools (see Figure 2) can helpto provide initial estimates of algorithmthroughput such as loop latencies andpipeline effective rates. Using such tools,you can interactively change optimizationoptions or iteratively modify and recompileC code to obtain higher performance. Suchdesign iterations may take only a matter ofminutes when using C, whereas the sameiterations may require hours of even dayswhen using VHDL or Verilog.

Case Study: Accelerating an FFTThe Fast Fourier Transform (FFT) is anexample of a DSP function that mustaccept sample data on its inputs and gener-ate the resulting filtered values on its out-puts. Using C-to-hardware tools, you cancombine traditional C programming meth-ods with hardware/software partitioning tocreate an accelerated DSP application. TheFFT developer for this example is compati-ble with any Xilinx® FPGA target, anddemonstrates that you can achieve resultssimilar to hand-coded HDL without resort-ing to low-level programming methods.

Our FFT, illustrated in Figure 3, uti-lizes a 32-bit stream input, a 32-bit streamoutput, and two clocks, allowing the FFTto be clocked at a different rate than theembedded processor with which it com-municates. The algorithm itself isdescribed using relatively straightforward,hardware-independent C code, with someminor C-level optimizations for increasedparallelism and performance.

The FFT is a divide and conquer algo-rithm that is most easily expressed recur-sively. Of course, recursion is not possibleon the FPGA, so the algorithm must beimplemented using iteration instead. Infact, almost all software implementationsare written iteratively (using a loop) forefficiency. Once the algorithm has beenimplemented as a loop, we are able toenable the automatic pipelining capabilitiesof the Impulse compiler.

Pipelining introduces a potentially highdegree of parallelism in the generated

certain aspects of the underlying hardware.In particular, you must understand howpartitioning decisions and C coding styleswill impact performance, size, and powerusage. For example, the acceleration of crit-ical computations and inner-code loopsmust be balanced against the expense ofmoving data between hardware and soft-ware. Fortunately, modern tools for FPGAcompilation provide various types of analy-sis tools that can help you more clearlyunderstand and respond to these issues.

Practically speaking, the initial results ofsoftware-to-hardware compilation from C-language descriptions will not equal theperformance of hand-coded VHDL, butthe turnaround time to get those first resultsworking may be an order of magnitude bet-ter. Performance improvements occur itera-


Impulse CCompiler

Hardware Accelerator

MicroBlaze

PERIPHERALS

ME

MO

RY

FSL

Clk2Clk1

32 32

3232

SoftwareApplication

HardwareProcess

FSL

FFT

Figure 1 – Impulse C custom hardware accelerators run in the FPGA fabric to

accelerate µClinux processor-based applications.

Figure 2 – A dataflow graph allows C programmers to analyze the generated hardware andperform explorative optimizations to balance tradeoffs between size and speed. Illustrated in

this graph is the final stage of a six-stage pipelined loop. This graph also helps C programmersunderstand how sequential C statements are parallelized and optimized.

Figure 3 – The FFT includes a 32-bit stream input, a 32-bit stream output, and two clocks,allowing the FFT to be clocked at a different rate than the embedded processor.

logic, allowing us to achieve the best pos-sible throughput. Our radix-4 FFT algo-rithm on 256 samples requiresapproximately 3,000 multiplications and6,000 additions. Nonetheless, using thepipelining feature of Impulse C, we wereable to generate hardware to compute theFFT in just 263 clock cycles.

We then integrated the resulting FFThardware processing core into an embed-ded Linux (µClinux) application runningon the Xilinx MicroBlaze™ soft-proces-sor core. MicroBlaze µClinux is a freeLinux-variant operating system ported atthe University of Queensland and com-mercially supported by PetaLogix.

The software side of the applicationrunning under the control of the operatingsystem interacts with the FFT through datastreams to send and receive data, and to ini-tialize the hardware process. The streamsthemselves are defined using abstract com-munication methods provided in theImpulse C libraries. These stream commu-nication functions include functions foropening and closing data streams and read-ing and writing those streams. Other func-tions allow the size (width and depth) ofthe streams to be defined.

By using these functions on both the soft-ware and hardware sides of the application, itis easy to create applications in which hard-ware/software communication is abstractedthrough a software API. The Impulse com-piler generates appropriate FIFO buffers andFast Simplex Link (FSL) interconnectionsfor the target platform, thereby saving youfrom the low-level hardware design thatwould otherwise be needed.

Embedded Linux IntegrationThe default Impulse C tool flow targets astandalone MicroBlaze software system. Insome applications, however, a fully featuredoperating system like µClinux is required.Advantages of embedded Linux include afamiliar development environment (appli-

cations may be prototyped on desktopLinux machines), a feature-rich set of net-working and file storage capabilities, atremendous array of existing software, andno per-unit distribution royalties.

The µClinux (pronounced “you-see-Linux”) operating system is a port of theopen-source Linux version 2.4. The µClinuxkernel is a compact operating system appro-priate for a wide variety of 32-bit, non-mem-ory management unit (MMU) processorcores. µClinux supports a huge range ofmicroprocessor architectures, including the

Xilinx MicroBlaze processor, and is deployedin millions of consumer and industrialembedded systems worldwide.

Integrating an Impulse C hardware coreinto µClinux is straightforward; the Impulsetools include support for µClinux and cangenerate the required hardware/softwareinterfaces automatically, as well as generate amakefile and associated software libraries toimplement the streaming and other func-tions mentioned previously. Using theXilinx FSL hardware interface, combinedwith a freely available generic FSL device


/* example 1 – simple use of ImpulseC-generated HW coprocessor and* Linux FSL driver* /

#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <stdio.h>

#define BUFSIZE 1024

void main(void){

unsigned int buffer[BUFSIZE];

/* Open the FSL device (Impulse HW coprocessor)*/int fd = open(“/dev/fslfifo0”,O_RDWR);

while(1){

/* Get incoming data – application dependent*/get_input_data(buffer);

/* Send data to ImpulseC HW processor on FSL port */write(fd, buffer,BUFSIZE*sizeof(buffer[0]);

/* Read the processed data back from the HW coprocessor */read(fd, buffer,BUFSIZE*sizeof(buffer[0]));

/* Do something with the data – application dependent */send_output_data(buffer);

}}

The Impulse compiler generates appropriate FIFO buffers and Fast Simplex Link (FSL) interconnections for the target platform, thereby saving

you from the low-level hardware design that would otherwise be needed.

Figure 4 – Simple communication between µClinux applications and ImpulseC hardware using the generic FSL FIFO device driver

driver in the MicroBlaze µClinux kernel,makes the process of connecting the softwareapplication to the Impulse C hardware accel-erator relatively easy.

The generic FSL device driver maps theFSL ports onto regular Linux device nodes,named /dev/fslfifo0 through to fslfifo7, withthe numbers corresponding to the physicalFSL channel ID.

The FIFO semantics of the FSL channelsmap naturally onto the standard Linux soft-ware FIFO model, and to the streaming pro-gramming model of Impulse C. An FSL portmay be opened, read, or written to, just likea normal file. Here is a simple example thatshows how easily a software application caninterface to a hardware co-processing corethrough the FSL interconnect (Figure 4).

You can easily modify this basic structureto further exploit the parallelism available.One easy performance improvement is tooverlap I/O and computation, using a dou-ble-buffering approach (Figure 5).

From these basic building blocks, youare ready to tune and optimize your appli-cation. For example, it becomes a simplematter to instantiate a second FFT core inthe system, connect it to the MicroBlazeprocessor, and integrate it into an embed-ded Linux application.

An interesting benefit of the embeddedLinux integration approach is that it allowsdevelopers to take advantage of all thatLinux has to offer. For example, with theFFT core mapped onto FSL channel 0, wecan use MicroBlaze Linux shell commandsto drive and test the core:

$ cat input.dat > /dev/fslfifo0 &; cat /dev/fslfifo0> output.dat;

Linux symbolic links permit us to aliasthe device names onto something moreuser-friendly:

$ ln -s /dev/fslfifo0 fft_core

$ cat input.dat > fft_core &; cat fft_core > output.dat;

ConclusionAlthough our example demonstrates howyou can accelerate a single embedded applica-tion using one FSL-attached accelerator,Xilinx Platform Studio tools also permit mul-tiple MicroBlaze CPUs to be instantiated inthe same system, on the same FPGA. By con-necting these CPUs with FSL channels andemploying the generic FSL device driverarchitecture, it becomes possible to create asmall-scale, single-chip multiprocessor systemwith fast inter-processor communication. Insuch a system, each CPU may have one ormore hardware acceleration modules (gener-ated using Impulse C), providing a balancedand scalable multi-processor hybrid architec-ture. The result is, in essence, a single-chip,hardware-accelerated cluster computer.

To discover what reconfigurable cluster-on-chip technology combined with C-to-hardware compilation can do for yourapplication, visit www.petalogix.com andwww.impulsec.com.


/* example 2 – Overlapping communication and computation to exploit* parallelism * /

#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <stdio.h>

#define BUFSIZE 1024

void main(void){

unsigned int buffer1[BUFSIZE],buffer2[BUFSIZE];unsigned int *buf1=buffer1;unsigned int *buf2=buffer2;unsigned int *tmp;

/* Open the FSL device (Impulse HW coprocessor)*/int fd = open(“/dev/fslfifo0”,O_RDWR);

/* Get incoming data – application dependent*/get_input_data(buf1);

while(1){

/* Send data to ImpulseC HW processor on FSL port */write(fd, buf1,BUFSIZE*sizeof(buffer[0]);

/* Read more data while HW coprocessor is working */get_input_data(buf2);

/* Read the processed data back from the HW processor */read(fd, buf1,BUFSIZE*sizeof(buffer[0]));

/* Do something with the data – application dependent */send_output_data(buf1);

/* Swap buffers */tmp=buf1;buf1=buf2;buf2=tmp;

}}

Figure 5 – Overlapping communication and computation for greater system throughput

by Gary DrosselDirector of Product MarketingSiliconSystems, [email protected]

Embedded systems often operate in lessthan ideal power conditions. Power distur-bances ranging from spikes to brown-outscan cause a significant amount of data andstorage system corruption, causing fieldfailures and potential loss of revenue fromequipment returns. You must considerhow your storage solution will operate inenvironments with varying power inputstability. If the host system loses power inthe middle of a write operation, criticaldata may be overwritten or sector errorsmay result, causing the system to fail.

Data CorruptionThe host system reads and writes data inminimum 512-byte increments called sec-tors. Data corruption can occur when thesystem loses power during the sector writeoperation, either because the system didnot have time to finish or because the datawas not written to the proper location.

In the first scenario, the data in the sec-tor does not match the sector’s error-checking information. A read sector errorwill occur the next time the host systemattempts to read that sector. Many appli-cations that encounter such an error willautomatically produce a system-level errorthat will result in system downtime untilthe error is corrected.

Eliminating Data Corruption in Embedded SystemsEliminating Data Corruption in Embedded Systems


SiliconSystems’ patented PowerArmor technology eliminates unscheduled system downtime caused by power disturbances.SiliconSystems’ patented PowerArmor technology eliminates unscheduled system downtime caused by power disturbances.

In addition to losing the data, the solid-state drive may thereafter interpret theread sector error as defective, and mayunnecessarily replace it with a spare. Ingeneral, about 1% of the storage capacityis reserved for spares. Once the factory-defined spares are depleted, the drive willfail and must be replaced.

In another potential problem, a brown-out or low-power condition could causethe address lines of the storage device tobecome unstable. If this happens – butthere is still enough power to program thenon-volatile storage component – datacould be written to the wrong address. Thisagain could result in a critical failure,requiring drive replacement.

These types of errors can be frustratingfor you to find. At the point the drive fails,it is sent back to the factory. The factorythen validates the error during the failureanalysis process. However, for the vendorto determine whether or not it is a compo-nent failure, they must re-initialize thedrive back to factory settings. After re-ini-tialization, the drive operates normally andis sent back. This is the typical return sce-nario that flags a power issue.

The critical point to understand is thatthe drive is not physically damaged, nor isit “worn out.” In both cases, the data willbe lost and downtime will occur, but thedrive can be used again.

Solid-State Drives versus Flash CardsNot all solid-state storage products are cre-ated equal. Even though they physicallylook the same, solid-state drives in aCompactFlash (CF) form factor differgreatly from CompactFlash cards designedfor the consumer electronics market. Theinherent technology differences are out-lined in Table 1.

The term “standard CompactFlashcard” can be a bit misleading. Flash cardsmust pass the test suite defined by theCompactFlash Association (CFA). Thesetests allow for a relatively wide range insome timing parameters, which can varywith both hardware and firmware changes.Most consumer-oriented products haveenough other system overhead to renderthese changes insignificant.

embedded systems can integrate varioustechniques for mitigating these power-relat-ed issues. These techniques are not econom-ically viable for consumer-based applications,but are essential to eliminate unscheduleddowntime in embedded systems.

SiliconSystems patented its PowerArmortechnology to eliminate drive corruptioncaused by power disturbances. Figure 1shows how PowerArmor integrates voltage-detection circuitry to provide an early warn-ing of a possible power anomaly. Once avoltage threshold has been reached, theSiliconDrive sends a busy signal to the hostso that no more commands are receiveduntil the power level stabilizes.

Next, address lines are latched (as shown

Embedded systems, on the other hand,usually have very strict timing require-ments. Even the slightest changes to thehost interface can cause significant issues.

The most industry-recognized differen-tiator between solid-state storage and com-mercial flash cards is in write/erase cycleendurance. Solid-state drives have superiorerror-correction capabilities and wear level-ing algorithms that can significantly extendthe life of standard storage components.The power issues I’ve described are signifi-cantly less well-known but are much moreprevalent in embedded applications.Approximately 75% of field failures are theresult of power-related corruption.

Solid-state drives specifically designed for


HOSTSYSTEM

R/B

PWR OUT PWR

READY/BUSY LOGIC

VOLTAGE DETECTIONAND REGULATION

PowerArmor

512-

Byte

Buf

fer

Solid

-Sta

te S

tora

ge A

rray

ADDRESS/DATA/CONTROL

SILICONDRIVE

VIN

Parameter SiliconSystems SiliconDrive CF CompactFlash Card

Write/Erase Endurance >2 M Cycles per Block <100 K Cycles per Card

Error Correction (ECC) 6 bit 1-2 bit

Wear Leveling Algorithm Over the Entire SiliconDrive Over Free Space Only

Power-Down Protection Yes No

R/W Speed (MBps)Single Sector 1-2 MBps 30-40 KBpsLarge File 6-8 MBps 3-5 MBps

Re-Qualification Cycle 3-5 Years 1 Year

Maximum Capacity 8 GB 4 GB

Table 1 – Key performance metrics distinguishing solid-state drives designed for embedded systems from flash cards used in consumer electronic devices

Figure 1 – SiliconDrive with integrated PowerArmor technology

in Figure 2) to ensure that data is written tothe proper location. In contrast, mostmicrocontroller-based flash cards’ addresslines can float to undetermined states ifinput power drops below the specifiedminimum operating level.

SiliconDrive technology integrates aRISC-based DSP that allows for a 512-byte

buffer size. This is 1/4 to 1/8 the size of amicrocontroller-based flash card. Thesmaller buffer size limits the time that datais in volatile memory, thereby reducing thepower and time required to clear the buffer,update the control bytes, and complete thedata transaction.

Larger buffer sizes require additionalcapacitance to hold the power up longer.This extra capacitance is physically verylarge and will generally not scale to formfactors smaller than 2.5- or 3.5-inch drives.SiliconDrive technology does not requirethis added capacitance and can thereforescale to virtually any industry-standardform factor.

Testing for Power IssuesMany companies have included power-down testing in the qualification process forany new storage media they are considering.

The ability to handle power problems hasan impact on overall reliability, customergoodwill, and total cost of ownership.

You must consider many factors whendeveloping a test system that characterizesthe effects of power disturbances on thestorage device. First, the power-down ramprate on power and all I/O pins must obvi-

ously be fast enough to emulate the loss ofsystem power. In the case of implementinga CF or PC card form factor, the ramp ratemust also be able to emulate the steeperramp associated with pulling out the prod-uct during a write cycle.

A robust, repeatable power-down testrequires power disturbances at various inter-vals of the write cycle. The ability to strictlycontrol and repeat this variable will exposeany weaknesses in the write cycle and accel-erate any failures. It is also extremely impor-tant to cut power to I/O and Vcc at thesame time. Many test platforms may onlycut power to Vcc and leave the I/O pow-ered. This will result in invalid test data andpotential damage to the device under test.

The amount of verification performedwill directly impact the test time. Testingafter every cycle could be prohibitive, andread sector errors may occur at any point in

time – yet the storage device may contin-ue to operate. The worst-case scenario isthat the first read sector error occurs in asystem file or critical data area. You maychoose to run the test until it encountersthe first read sector error or until the drivebecomes inoperable.

SiliconSystems has designed a general-purpose power-down tester based on theXilinx® ML401 development platform.This system, along with available sourcecode to facilitate multi-threaded opera-tion, allows you to quickly qualify and val-idate various storage media based onindividual application needs.

ConclusionSolid-state storage will continue to replacerotating hard disk drives in embedded sys-tems as the cost per usable gigabyte (thatis, the cost for the capacity required by theapplication) continues its rapid decline.The immense success of Apple’s iPodnano, which uses solid-state storageinstead of hard drives, is currently themost visible example of things to come.

This trend will, however, come with aprice. The engineering tradeoffs betweenstorage component reliability – lowerwrite/erase cycle endurance and higher bit-error rates – and cost will accelerate theneed for PowerArmor, enhanced error cor-rection, and wear leveling capabilities. Inaddition, new technologies likeSiliconSystems’ SiSMART, which moni-tors and predicts the SiliconDrive’sremaining usable life span, will be requiredfor embedded systems with multi-yeardeployments. This will drive an even larg-er differentiation between solid-statedrives designed specifically for embeddedsystems and flash cards designed for con-sumer electronic applications.

For more information, visit www.siliconsystems.com.


Vcc Power Supply

Address Lines

Threshold

Latched Address Lines

Data Lines

Solid-state storage will continue to replace rotating hard disk drives inembedded systems as the cost per usable gigabyte (that is, the cost for the

capacity required by the application) continues its rapid decline.

Figure 2 – Latching the address lines when Vcc reaches its lower threshold will ensure data is written to the proper location.

A “little” Mistake That Could Cost You $100,000 Per Year

SiliconDrive is certified for use with the Xilinx®

System ACE™ Development Environment.

To learn more about SiliconDrive and its many advantages, visit us at www.siliconsystems.com/xilinx

A “little” mistake is in thinking that all solid-state storage is engineered equal.

Not even close.

SiliconSystems SiliconDrive is advanced storage technology engineered to overcome the problems associated with hard drives and flash cards.

What does this mean to you?

For starters, thousands of dollars in costs savings from eliminating unscheduled downtime resulting from field failures, product wear-out or forced product re-qualifications.

SiliconSystems™, SiliconDrive™, PowerArmor™ and SiSMART™ and the SiliconSystems logo are trademarks or registered trademarks of SiliconSystems and may be used publicly only with permission from SiliconSystems and require proper acknowledgement. Other listed names and brands and trademarks are trademarks or registered trademarks of their respective owners.

by Farzad ZarrinfarVP of Worldwide Sales & Marketing Poseidon Design Systems [email protected]

Bill SalefskiVP of TechnologyPoseidon Design [email protected]

Stephen SimonSr. Director of Marketing and Business DevelopmentPoseidon Design [email protected]

Rapid increases in FPGA platform capa-bility are fueling a corresponding increasein complex SOC designs. This addedcomplexity is pushing designers to rethinktheir traditional design flow. Designers arelooking to system-level tools to aid themin making critical architectural decisions;they are also using pre-defined cores tosimplify their design tasks and reduce theoverall design cycle. The additions ofembedded processors like the PowerPC™and Xilinx® MicroBlaze™ soft-coreprocessor have greatly expanded the capa-bility of these platforms and have likewiseadded to the overall complexity.

Boost Your Processor-BasedPerformance with ESL ToolsBoost Your Processor-BasedPerformance with ESL Tools


Shorten design cycles and lower costs while increasing performance for Xilinx FPGAs using Poseidon’s breakthrough ESL technology.Shorten design cycles and lower costs while increasing performance for Xilinx FPGAs using Poseidon’s breakthrough ESL technology.

The embedded processor provides realadvantages over discrete implementationswhen developing complex systems: reducedsilicon area, higher flexibility, and a shorterdesign cycle. These advantages are con-firmed with the current rapid growth ofembedded FPGA designs, but they do havesome limitations. The major challenges ofthese designs are:

• Insufficient processor performance tosupport ever-increasing functionaldemands

• Complex memory hierarchy and system architecture bottlenecks

• Hardware/software partitioning

• Implementing closely coupled interfaces such as APU or FSL

• Productivity limitations and dealingwith the complexity of RTL design

Processor solutions run out of process-ing power rapidly when implementingmath-intensive operations. It is very costlyto simply increase the clock rate or movethe newest technology to enable the designto be scalable. These designs must be ableto scale the performance of the processorarchitecture to match the ever-increasingdemand placed on the system. It is costly tomove to new processor architectures or uti-lize higher performance FPGAs.

methodologies to augment your architec-tural design methodology. It is also difficultto perform an effective partition of thehardware/software early in the design,which is critical to an effective develop-ment of the architecture.

To maximize the benefits of processor-based designs, you must verify architecturesearly in the design cycle. You cannot waituntil RTL development to discover that yourarchitecture does not support your systemrequirements. This “redesign loop” quicklyerodes time-to-market advantages, whichmakes FPGA-based systems preferred.

With Poseidon tools, you can overcomethese design challenges.

Triton Tool Suite Poseidon’s Triton Tool suite is a systemdesign and acceleration environment thatenables you to quickly develop, analyze,and optimize system architectures. Withthese ESL tools, the abstraction level of thedesign is above RTL – you can quicklyaddress system issues without having tosolve the detailed issues surrounding RTLimplementation. Thus, you can quicklyperform architectural “what-if ” analysisand accelerate time to market. The toolsuite was created specifically for processor-based systems requiring efficient, robustarchitectures with the need to optimizeperformance, power, and cost.

Triton comprises two main tools:

• Triton Tuner – a system and softwareanalysis tool

• Triton Builder – a hardware/softwarepartitioning tool

Triton Tuner is a simulation and analy-sis environment based on SystemC.Simulation is performed at the transactionlevel from models of both the processorand surrounding buses and peripherals.Using Tuner, you co-simulate the hardwarewith the application software.

During simulation, the tool collectsboth hardware and software performancedata. The environment then provides toolsto visualize and analyze the data, reducingthe effort required to identify inefficienciesin the design. Figure 1 shows a bus activity

The second critical issue is the need tosupport complex memory hierarchies andsystem peripherals. To reach the perform-ance potential of modern-day processors,designers have turned to memories andarchitectures, which rely on special burstmodes and caching to keep up with proces-sor data requirements. It is critical that thememories and architectures operate inthese modes to have a high-efficiency system design.

Effective hardware and software parti-tioning can be a confusing task. It is diffi-cult to determine the proper mix ofhardware and software to meet systemrequirements without the costly and time-consuming task of moving large portions ofan algorithm to hardware.

Special interfaces bring expanded capa-bility to the system architecture. These fea-tures add an efficient high-speed interfaceto the processor core and are ideal for par-titioning certain processing-intensive func-tions to hardware. But this capability doescome at a price. These interfaces are com-plex and take a substantial amount of timeand expertise to effectively utilize wheninterfacing to custom logic.

Designing an efficient processor-basedsystem architecture with overall system per-formance and optimized for a specificapplication is not trivial; accomplishingthis feat requires breakthrough tools and


Figure 1 – Bus activity graph

graph – a visualization tool to help identifybus congestion and bottlenecks in the sys-tem design.

Triton Tuner also links hardware andsoftware events, giving you a key tool indetermining causal relationships betweenhardware and software systems. This fea-ture greatly simplifies the task of evaluat-

ing and optimizing system performance.Triton Tuner also provides tools to opti-mize the memory structure. This cangreatly reduce the need for costly cachesand other high-speed memories.

Triton Builder is an automated parti-tioning tool, which simplifies the task ofaccelerating the performance of processor-based algorithms. With Triton Builder,you can easily offload compute-intensiveloops and functions from software tohardware. The tool automatically creates adirect memory access (DMA)-based hard-ware accelerator to perform the offloadedtask. All hardware and source-code modi-fications are also created to greatly simpli-fy the re-partitioning process.

With the Builder tool, you can controla number of key parameters in which tooptimize the solution to your systemrequirements. Together, these tools pro-vide a system design and optimization

environment that unobtrusively plugsinto a Xilinx EDK design flow, enablingyou to quickly make dramatic improve-ments in the performance and power con-sumption of your application.

Triton Builder generates the RTL forthe complete accelerator, as well as thedriver required to invoke the accelerator,

and inserts the driver into the properplace in the original application. TheRTL can be generated in either Verilog orVHDL. This integral generation processensures that necessary hardware and soft-

ware components are exactly matched fortrouble-free design. The tool also gener-ates an RTL test bench and SystemCmodel for verifying the new hardware.

FPGA-Based Accelerator Design Architecture The Builder tools create a design architec-ture ideally matched to FPGA design.Using Triton Builder, you can quickly cre-ate a peripheral using the C code that runson the processor. The tool moves the com-putationally intensive portions of the codeinto a hardware accelerator. This accelera-tion hardware is connected directly ontothe processor bus or the tightly coupledinterface (APU or FSL) and is implement-ed on the FPGA fabric. A block diagramof a typical accelerated architecture isshown in Figure 2.

If while executing the application codethe processor hits a section of code that hasbeen moved to hardware, control is passedto the accelerator through the inserteddriver, which then performs the accelerat-ed function. The accelerator runs inde-pendently from the processor, freeing it toperform other tasks. When the acceleratedtask is completed, the results are passedback to the application program. You canimplement multiple independent accelera-tors within the same system.

To create a complete accelerator periph-eral, you need more than just a C synthe-sis tool. Poseidon Builder includes not justa C-to-RTL synthesizer that creates thecompute core; it also generates the com-munication and control hardware. ThePoseidon tool creates a complete accelera-tor peripheral, the block diagram of whichis shown in Figure 3. In addition to thecompute core, the accelerator includes amulti-channel DMA controller, bus inter-face, local memories, and other logic tocreate a plug-and-play peripheral.

Interfacing to Xilinx ToolsTriton Tools are extremely flexible, allow-ing you to use either Triton Tuner or


Thread 1 Thread 2Thread 3 Poseidon-Generated

Accelerator and Driver=

Applications in ANSI C

Driver

Accelerator1

Accelerator2 Memory

Controller

Off-ChipMemory

Driver

MicroBlazeor

PowerPC

Cache

OPB or PLB Bus

I/O

OPB or PLB Bus

Local

Memory

Compute

Core

DMA

Controller

Bus Interface

Figure 3 – Accelerator block diagram

Figure 2 – Accelerated system architecture

With Triton Builder, you can easily offload compute-intensive loops and functions from software to hardware.

Triton Builder independently or togetheras an integrated suite. These tools weredeveloped to enhance the productivity indeveloping processor-based designs andvastly increase their capability.

Triton Tools link seamlessly to EDKtools. A typical system design flow is usu-ally an iterative process where you would

analyze system performance, determineinefficiencies, modify the system, andcheck the resulting performance. Whenthe architecture performs to the desiredlevel, the system is then transferred backinto the Xilinx tool chain. Triton Toolsaccelerate the process of identifying prob-lem areas and establish an integrated flow

that allows you to move between the toolsto develop the optimal architecture.

A typical flow (shown in Figure 4) com-prises these steps:

• The designer develops the target archi-tecture using selected Xilinx tools

• The architecture description is readfrom the microprocessor hardware spec-ification file (.mhs) from EDK, and theANSI C application source code is readinto the Triton tools

• Triton Tuner profiles the ANSI C code,reveals bottlenecks in the code or archi-tecture, and eliminates inefficiencies

• The designer selects which software willbe partitioned to hardware

• Triton Builder partitions compute-intensive algorithms in ANSI C intohardware and generates a hardwareaccelerator

• Triton Tuner verifies that the new sys-tem performs to the desired level

• RTL, test bench, modified C code,driver, and architecture are exportedback into the Xilinx environment

ConclusionPoseidon’s Triton Tool suite enables you torapidly and predictability analyze, optimize,and accelerate processor-based architec-tures. With Triton Tuner’s SystemC simula-tion environment, you can develop robustefficient processor architectures. WithTriton Builder, you can add sophisticatedhardware acceleration to your processor-based systems and generate RTL from C.

The Triton tool suite enables designarchitects to achieve higher system through-put, reduced power consumption, and cost,as well as shorten design cycles. This is forFPGAs implementing DSP-intensiveapplications using embedded XilinxPowerPC and MicroBlaze processors.

For more information, visit our websiteat www.poseidon-systems.com or contact asales representative at (925) 292-1670. Tobe qualified for a free tool evaluation of thePoseidon Builder and Tuner and a freewhite paper, e-mail [email protected].


PoseidonTriton Tool Suite

Hardware/SoftwarePartitioningAcceleraton

System SimulationPerformance

Analysis

ANSI C Application Codeand

Architecture Description

Accelerator RTLModified C-CodeAccelerator ModelRTL Test Bench

Memory

Controller

Custom

Auxillary

Processor

APU

Controller

GPIO

UART

GPIO

Other

BridgeCustom

Logic

CoreConnect OPD

Co

reC

on

nec

t O

PD

PowerPC

Co

reC

on

nec

t P

LB

Memory

Controller

Custom

Logic MicroBlaze

(XPS)

App

Code Design

Entry

HW

IP

ISE Implementation

SW

IP

RTOS

BSP

Compiler Debugger

Inst

ruct

ion

Set

Sim

ula

tor

BFM

Co-Design: Hardware <> Software Partitioning

Embedded Hardware

Software Design

Embedded Processor

Subsystem Design

Co-Verification

OnChip Debug

FSL (optional)

Figure 4 – Integrated Poseidon and Xilinx tool flow

A series of compelling, highly technical product demonstrations, presented

by Xilinx experts, is now available on-line. These comprehensive videos

provide excellent, step-by-step tutorials and quick refreshers on a wide array

of key topics. The videos are segmented into short chapters to respect your

time and make for easy viewing.

Ready for viewing, anytime you are

Offering live demonstrations of powerful tools, the videos enable you to

achieve complex design requirements and save time. A complete on-line

archive is easily accessible at your fingertips. Also, a free DVD containing all

the video demos is available at www.xilinx.com/dod. Order yours today!

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

FREE on-line training Demos On Demand

Pb-free devicesavailable now

with

www.xilinx.com/dod

By Roger SmithChief EngineerEcholab LLC, [email protected]

Television mixers have historically beenbuilt with dedicated hardware to achieve aspecific fixed functionality. As mixers haveevolved from devices built with discretetransistors to more modern mixers withadvanced large-scale integration (LSI) inte-grated circuits, a common limitation hasbeen that these devices were built withfixed signal and data paths that pre-definethe topology of the mixer. Thus, a mixertargeted for two mix/effects (M/Es) andtwo keyers per M/E is built specifically forthat function, with limited or no futureadaptability.

The Echolab Nova series completelybreaks with tradition in this regard, movingto a completely reconfigurable platformbased on a system-on-chip architecture.

We used a single Xilinx® Virtex™-IIPro FPGA to create a completely repro-grammable video switching system. Byabsorbing the interconnects of the mixerinto a single FPGA, the limitations of afixed signal path architecture have beenremoved, and the topology of the mixercan be redefined again and again through-out the life of the product.


Unveiling NovaUnveiling NovaUsing a single Virtex-II Pro FPGA to create a reprogrammable video switching system.Using a single Virtex-II Pro FPGA to create a reprogrammable video switching system.

Nova FamilyIn 2004, Echolab launched the first mem-ber of the Nova series, the Nova 1716. Thisproduct is a full program/preset mixer with16 inputs and 16 outputs, two downstreamkeyers, and a full M/E upstream completewith two effects keyers. In 2005, we intro-duced several new members of the family.The Nova 1932 is a 32-input program/pre-set mixer, with two upstream M/Es and a

full complement of six keyers. The identity4is a 1 M/E 16-input look-ahead previewmixer. The identity4 brings tremendousadvances in video layering, with fourupstream and two downstream keyers andfive internal pattern generators, all in aflight-pack-size panel and frame.

As shown in Figure 1, what is uniqueabout these mixers is that they share acommon frame and electronics, which aresimply re-programmed to fit the targetmixer design.

I/OMajor advances in FPGA design were nec-essary to undertake such a dramatic shift inmixer architectures. One of the first chal-lenges that we had to overcome was how toget the video bandwidth into and out ofthe FPGA.

Even the smallest Nova family member(shown in Figure 2) has 16 SDI inputs and16 SDI outputs. At 270 Mbps, this is anaggregate video data bandwidth of morethan 8.5 Gbps. The Nova 1732 and 1932family members have 32 inputs and 16 out-puts, approaching approximately 13 Gbps.

One of the key aspects of the Virtex-IIPro device is its ability to support I/Obandwidths in excess of 400 MHz on allFPGA I/O pins. Another key feature of

clock cannot be initially used to sample allof the incoming signals.

Typically, most SDI mixers use individ-ual clock and data separator circuits on eachinput so that the hardware can recover theindividual bits from each stream. After thedata streams are separated and decoded,sync detector circuits are used to write thesestreams into FIFO memories. A commongenlock clock and reference is then used toread out the video streams from the memo-ries for effects processing downstream.

This topology is not viable for Echolab’ssystem-on-chip architecture. The largenumber of clocks created by these front-end clock and data separators would over-whelm the FPGA’s support for the totalnumber of clocks, as well as the inherentneed to crowd these parts near the FPGA.The video interface to the FPGA must besimpler, and require few or no parts nearthe FPGA. Thus, traditional clock and dataseparation techniques do not work here.

Echolab applied an asynchronous data-recovery technique from Xilinx applicationnote XAPP224 (“Data Recovery”) original-ly developed for the networking market.The technique uses precise low-skew clocksto sample the inputs in excess of a gigahertz.The samples are examined to determine thelocation of the data bit cell transitions. Theencoded data is extracted from the streamand can cross into the genlock clockdomain without ever extracting the clockfrom the input stream. For more informa-tion about this technique, see www.xilinx.com/bvdocs/appnotes/xapp224.pdf.

The FPGA’s ability to route nets andguarantee skew performances on the order ofpicoseconds has enabled development andimplementation of this unique SDI input.

Another design challenge in digital mix-ers has been the implementation of videoline delays and FIFOs, which have histori-cally been used in large numbers to timeinternal video paths and outputs. It is oftennecessary to add delay lines to AUX busoutputs to keep them in time with the pri-mary mixer outputs. These delay lines andFIFOs have typically been implementedwith discrete memory devices.

The Xilinx FPGA solution containslarge numbers of video line-length memo-

Xilinx parts is support for internal termina-tion of high-speed differential inputs.Because of the high-speed nature of thisvideo I/O, the wires must be treated astransmission lines, with great care paid toelectrical termination. Historically, thesetermination resistors would be placeddirectly outside the chip, as close as possi-ble to the end of the line. The need for asubstantial amount of terminations so close

to the chip would be a layout complication.Because Xilinx can support internal termi-nation of these high-speed video signals,this has greatly simplified the architecture,making the support of large blocks of high-speed video I/O possible.

The next challenge was deciphering theSMPTE 259M stream. Because the SDIstreams are coming from all over the stu-dio, even when genlocked (synchronizing avideo source with other television signals toallow the signals to be mixed), there can belarge phase shifts of +/- half a line betweenthese signals, and therefore, a common


Figure 1 – The Nova switcher family

Figure 2 – Nova SDI I/O

ries on chip, which lend themselves natu-rally to delay lines. Xilinx has enough ofthese memories on-chip to allow all AUXbus outputs on the Nova series to be timed.

Crosspoint ArrayThe next challenge was to implement acrosspoint array of sufficient size to supportthese large mixers without overwhelmingthe resources of the chip. Although thecrosspoint array appears to be a reasonablesize from the front panel, the array is oftenmuch larger because it must support all ofthe internal sources and functions insidethe mixer.

The Nova 1716 has 16 external inputs,but it also has 3 internal colorizers, black,three internal frame buffers, and severalintermediate sources generated by theupstream M/E. For outputs from thearray, each M/E requires seven videobuses (A/B, key 1 cut and fill, key 2 cutand fill, as well as video for borders).There are also 12 AUX bus feeds and ded-icated buses to support capture on theinternal frame buffers.

By the time you add it all up, therequired internal crosspoint array is easily30 x 30 for the 16-input mixer. To developthis crosspoint array as a 10-bit wide paral-lel implementation would consume a largeamount of resources within the FPGA.

A more effective use requires thedesign of high-speed serializers and de-

serializers within the part. TheVirtex-II Pro FPGA’s ability togenerate multiples of the gen-lock clock within the part withlow skew allows you to createwhole sections of the chip thatcan run at the SMPTE 259Mbit-serial rate of 270 MHz. Thisimplementation of an SDI serialrate crosspoint array (Figure 3)effectively limits the use of valu-able FPGA fabric to less than 10percent of the capacity of thetarget chip.

Effects GenerationOnce the streams have gonethrough the crosspoint array andhave been de-serialized, the videobuses, now “timed,” can be rout-ed to the appropriate processingblocks within the FPGA to per-form various video effects (seeFigure 4).

The creation of video effectswithin the FPGA such as wipes,mixes, and keys is easily per-

formed with the basic building blocks of theFPGA. Most basic video effects can be per-formed with nothing more than simplecombinations of addition, subtraction, andmultiplication. Hundreds of embeddedhigh-speed multipliers within the FPGAfabric allow a variety of video effects to beperformed effortlessly with very high preci-sion.

Embedded memory can also be usedfor LUTs and filter coefficient storage. Anexample of an often-used filter would bean interpolating filter for 4:4:4 up-sam-pling before a DVE or Chromakey.

A high-precision circle wipe requires asquare-root function. This complicatedmathematical function was implementedwith a CORDIC core provided by Xilinx.These blocks of pre-built and tested IPfrom Xilinx and other third-party devel-opers allow you to rapidly deploy designswithout having to develop and test everybuilding block from scratch.Development with these IP cores canremain at a very high level, providingquick time to market. Also, the optimized


ASYNCSampling

FIFO

BNC

SDI VIDEOINPUTS(16/32)

SDI INPUT CIRCUITS

SERIALIZERS

DE-SERIALIZERS

CROSSPOINT ARRAY

BNC

FRAME BUFFERS

Black Generator

Colorizers

GENLOCKSDI

VIDEO FRAMEMEMORIES

Xilinx Virtex-II Pro FPGA

ME1OUT

Delay Line AUX

ME 1

SDI VIDEOOUTPUTS

(16)

Delay Line

Delay Line

Delay Line

PRGPRVCLFD

ME 0

POWER PC

UARTSFLASH Controller MAC Memory Controller

DVE Frame Memory(ZBT)

SDRAMProgram Memory

ETHERNET(RJ45)

SERIAL PORTS(DB9)

COMPACTFLASH CARD

SWITCHERCONTROL

PANEL

© 2005

CHROMAKEY

DVE

Figure 3 – Virtex-II Pro switcher implementation

Figure 4 – Effects


size of these cores allows the design ofcomplex mixers within a single chip.

Embedded PowerPCThe computer horsepower necessary to runtoday’s large vision mixers has grownimmensely over the last dozen years. Theneed to accurately support field-rate effectson multiple M/E banks while at the sametime communicating with complex controlpanels and many external devices has testedthe limits of earlier 8- and 16-bit micro-processors.

Most manufacturers have either usedseveral smaller distributed processors or alarger, faster 32-bit microprocessor.Echolab’s system-on-chip architecture (seeFigure 5 – PowerPC™ implementation)takes advantage of two 32-bit PowerPCs

(running at 270 MHz) embedded directlyin the fabric of the Xilinx FPGA. This tightcoupling between the processor and themixer hardware leads to substantial per-formance improvements.

A vast library of pre-built processorperipherals from Xilinx and third partiesmakes it easy for functions like serial ports,memory controllers, and even Ethernetperipherals to be dropped right into thedesign. Custom peripherals are also easy todesign.

Several large switcher manufacturershave gone to commercial operating systemslike Windows or Linux to improve theirsoftware productivity. Although the earlybenefits can be appealing, the downside tothis transition is a loss of control over thereliability of the switcher’s code base, mak-ing a device that is not as robust and bul-

letproof as previous generations of mixers.Echolab has chosen Micrium’s µC/OS-

II real-time operating system (RTOS) forthe Nova series (see www.micrium.com).This OS is a priority-based, pre-emptivemultitasking kernel that has been certifiedfor use in safety-critical applications inmedical and aviation instruments. Time-critical video processing is assigned to thehighest priority task. Management of thefile system, console I/O, and network stackare allocated to lower priority tasks, allow-ing the processor to utilize spare processorcycles in the background without everinterfering with the video hardware.

The tasks communicate with each other– and synchronize their activity – withthread-safe semaphores and messagequeues provided by the OS.

System ConnectivityAll Nova series switchers have multipleports for broadcast studio interconnectivi-ty. An industry-standard RS-422 portallows for the implementation of industry-standard editing protocols. A standard RS-232 port is available for PC connectivity.An Ethernet port allows the switcher to bedirectly connected to a network.

Under control of Nova’s µC/OS-II, sev-eral servers are running concurrently thatprovide an integrated Web server forremote status and display, an embeddedXML-RPC server for remote control, aswell as a full TFTP server for remoteupload and download of graphics and stills.

Compact Flash and Re-ConfigurabilityAt the heart of the Nova system is an indus-try-standard Compact Flash card (Figure

6), designed to hold all of the firmware andsoftware to configure and boot the Nova.With a user-accessible mode switch, youcan load as many as eight different on-lineconfigurations of Nova firmware and soft-ware from a single flash card.

The remaining storage on the card isavailable to store user data such assequences and panel saves, as well as keymemories and other user settings. Also, theCompact Flash is designed to hold all ofthe graphics and stills online. Support forcards up to 2 GB or more provides animmense storage capacity, well beyond thearchaic floppy disks found in competitiveproducts.

Given Nova’s unique architecture, it iseasy for the product to be extendedthrough downloadable firmware updates.These updates can be as simple as a routine

software patch, as complex as adding akeyer to an AUX bus output, or restructur-ing the internal video flow within an MEfor a specialized application. Architecturesas different as the 2 ME Nova 1716 (withits program/preset architecture) and theNova identity4 (with a six-keyer look-ahead preview structure) can be loaded intothe same hardware.

Roadmap to the FutureNext-generation Virtex-4 FPGA technologyfrom Xilinx will allow Echolab to move itssystem-on-chip architecture to support high-definition products. Virtex-4 FPGAs bring a

Figure 5 – PowerPC implementation

Figure 6 – Compact Flash


higher level of performance to the embeddedlogic, as well as the embedded peripherals.Some of these enhancements include:

Fabric enhancements• Larger arrays

• Faster

• Lower power

I/O enhancements• General-purpose I/O speeds to 1 GHz

• Dedicated Rocket IO™ transceiverspeeds beyond 10 GHz

Dedicated hardware resources• Multiple tri-mode Ethernet MACs

• 500 MHz multipliers with integrated48-bit accumulators for DSP functions

• Block memories now have dedicatedaddress generators for FIFO support

ConclusionTelevision mixers have grown larger andmore complex in the last dozen years.More and more, they are the focal pointfor the interconnection of a wide range ofstudio equipment.

One of the primary benefits of the sys-tem-on-chip architecture is reduced partscount. This reduction in parts count con-tributes directly to lower power, reducedPCB complexity, higher reliability, andreduced cost.

Another major benefit of the system-on-chip architecture is that it is almost entire-ly reconfigurable. This has allowedmultiple products with different videoarchitectures to be built on a common plat-form. This also lends itself to easy cus-tomization for specialized applications orspecific vertical markets.

As the computer network continues toplay more of a role in today’s modern tele-vision studio, the Nova series will be readywith support for streaming video overGigabit Ethernet. H.264 and WMV9codecs will drop right into the Nova’s sys-tem-on-chip architecture, providing futurefeatures on today’s hardware.

For more information, please feel free tosee Echolab’s complete line of televisionmixers at www.echolab.com.

Get Published

www.xilinx.com/xcell

It's easier than you think!

Submit an article draft for our Web-based or printed publications and we will assign an editor and a

graphic artist to work with you to make your work look as good as possible.

For more information on this exciting and highly rewarding program, please contact:

Forrest CouchPublisher, Xcell Publications

[email protected]

See all the new publications on our website.

Would you like to write for Xcell Publications?

by Jue Sun Systems Engineering Intern Xilinx, [email protected]

Peter RyserSr. Manager, Systems EngineeringXilinx, [email protected]

The Tri-Mode Ethernet MAC (TEMAC)UltraController-II Module (UCM) is a min-imal footprint, embedded network process-ing engine based on the PowerPC™ 405(PPC405) processor core and the TEMACcore embedded within a Xilinx® Virtex™-4FX Platform FPGA. It allows you to interactwith your Virtex-4-based system through anEthernet connection and to control or mon-itor your system remotely using TCP/IPfrom miles away. Our design uses minimalresources and ensures that you will haveenough logic area for your application.

In this article, we will explain theadvantage of our design and how it isimplemented on an ML403 board. We willalso introduce a few applications that youcan build on top of this implementation.

Implementing a Lightweight WebServer Using PowerPC and Tri-ModeEthernet MAC in Virtex-4 FX FPGAs

Implementing a Lightweight WebServer Using PowerPC and Tri-ModeEthernet MAC in Virtex-4 FX FPGAs


Using Ethernet, you can connect your device to the Internet, monitoring and controlling it remotely.Using Ethernet, you can connect your device to the Internet, monitoring and controlling it remotely.

ImplementationThe TEMAC UCM leverages key innova-tions in Virtex-4 FPGAs to implementTCP/IP applications with minimalresource usage, as shown in Figure 1. Thewhole implementation takes one embed-ded PPC405, one integrated TEMAC, twoVirtex-4 FIFO16s, 20 slice flip-flops, and18 look-up tables (LUTs).

On the ML403 board, the TEMACUCM connects to an external PHY througha gigabit media independent interface(GMII) and a management data input/out-put (MDIO) interface, and auto-negotiatestri-mode (10/100/1000 Mbps) Ethernetspeeds. Other physical interfaces likeRGMII and SGMII are possible and take aminimum amount of change in the refer-ence design that is provided as source code.

The software is ported from the open-source µIP TCP/IP stack and runs com-pletely out of the PPC405’s 16 KBinstruction and 16 KB data cache. It access-es Ethernet frames in two FIFO16s throughthe PPC405 on-chip memory (OCM)interface. One FIFO buffers inbound whilethe other buffers outbound Ethernet frames.The FIFOs also act as entities to synchronizeclock domains between the PPC405 OCMand the TEMAC. The PPC405 is capable ofrunning the software at maximum frequen-cy, for example, 350 MHz in a -10 speedgrade FPGA.

The application implemented in XilinxApplication Note XAPP807, “MinimalFootprint Tri-Mode Ethernet MACProcessing Engine,” (www.xilinx.com/bvdocs/appnotes/xapp807.pdf) runs a Webserver on top of the µIP TCP/IP stack. TheWeb server serves a Web form to connect-ing clients. You simply enter a string in atext field and submit it to the server, whichinterprets the data and displays the stringon a two-line character LCD. See Figure 2for a picture of the process.

The µIP TCP/IP stack has a well-defined API and comes with other demon-stration applications like telnet. Based onthe well-written µIP documentation, youcan implement your own application with-out much effort. The stack is optimized forminimal footprint at the cost of perform-ance. The restriction on TCP/IP perform-


DBGC405DEBUGHALT

BRAMDSOCMRDDBUS

EICC405EXTINPUTIRQ

SYS_RST

JTAG_TMS

JTAG_TDI

JTAG_TCK

SYS_CLK

INTERRUPT

CPMC405CLOCK

PLBC405ICURDDBUS

BRAMISOCMRDBUS

JTGC405TCK

JTGC405TDI

JTGC405TMS

RSTC405RESETSYS

RSTC405RESETCHIP

RSTC405RESETCORE

C405DBGMSRWE

ISOCMBRAMWRDBUS

C405RSTSYSRESETREQ

C405JTGTDOEN

C405JTGTDO

ISFILLDCR 0x3F1

Instruction Cache(16KB)

ResetLogic

HALT Logic

Reset and InterruptVector (Hardwired)

Reset and BootCode (Hardwired)

Data Cache(16KB)

Ultra Controller-IIPowerPC

GPIO_OUT[0:31]GPIO_IN[0:31]

JTAG_TCK JTAG_TDO

JTAG_TDOEN

SYS_RST_OUT

Figure 1 – General datapath of TEMAC UCM

Figure 2 – Example Web server allowing remote control of ML403

Figure 3 – UltraController-II ports

ance is not a limiting issue for most moni-tor and control applications.

Implementation FlowImplementing the TEMAC UCM isstraightforward. The implementation flowcomprises three main parts: one flow inProject Navigator for hardware bit file gen-eration; one flow in EDK for software ELFfile generation; and another flow to com-bine the software and hardware file into asingle bitstream or PROM file to programthe Virtex-4 FX12 FPGA or the PlatformFlash on the ML403 board, respectively.Project files and scripts for all of these flowsare included in XAPP807. The hardwaresource code for the TEMAC UCM is avail-able in Verilog and VHDL. The softwaresource code is written in C.

The TEMAC UCM is based onUltraController-II, as shown in Figure 3 anddocumented in XAPP575, “UltraController-II: Minimal Footprint Embedded ProcessingEngine” (www.xilinx.com/bvdocs/appnotes/xapp575.pdf). UltraController-II is a black-box processing engine that includes 32 bits ofuser-defined general-purpose input and out-put, as well as interrupt handling capability.

Loading the Software into CachesLoading software into the instruction anddata caches of the PowerPC has been possi-ble for some time through System ACE™CF technology or other JTAG-based meth-ods. However, for the first time, theTEMAC UCM makes this also possiblethrough all configuration modes, includingJTAG mode, slave and master serial modes,and slave and master SelectMap modes.Other configuration solutions besidesSystem ACE CF can load code and datainto the PPC405 caches. Such solutionsinclude Xilinx Platform Flash as well asexternal processors and methods describedin application notes such as XAPP058,“Xilinx In-System Programming Using an

Embedded Microcontroller” (www.xilinx.com/bvdocs/appnotes/xapp058.pdf).

The cache loading solution emphasizesanother new feature in the Virtex-4 FPGAfamily. The USER_ACCESS register is a32-bit register that provides a port fromthe configuration block into the FPGAfabric. For loading the caches, theUSER_ACCESS register is connectedthrough a small state machine to theJTAG port of the PPC405. A script con-verts a software ELF file into a bitstreamthat you can load into the processorcaches with Impact or the ChipScope™Analyzer. If you are interested in findingout more about this solution, XilinxApplication Note XAPP719, “PowerPCCache Configuration Using theUSR_ACCESS_VIRTEX4 Register,”(www.x i l i n x . c om /bvdo c s / a ppno t e s /xapp719.pdf) provides full details.

Use Cases for the TEMAC UCMFigure 4 shows some use cases for theTEMAC UCM. Besides monitoring andcontrol application, other applicationsinclude statistics gathering, system diag-

nostics, display control, or math and datamanipulation operations.

A system application that deals withMPEG transport streams gathers statisticalinformation about the video and audiostreams. The TEMAC UMC makes thisinformation available through the Webinterface. On the control side, the contentprovider dynamically loads decryptionkeys for encrypted streams into the systemthrough the same Web interface.

In another example, an FPGA con-trolled machine gathers information abouterroneous conditions that the machinemanufacturer accesses locally or remotelyfor diagnosis. In return, the manufacturerloads new machine parameters into thesystem to adjust undesired behavior.

In a third example, some networkingequipment reports its operational status ona regular basis through e-mail to a controlcenter where the information is processed.The control center equipment or personneltakes action if the equipment reports amalfunction or fails to report at all.

In all of these cases, the TEMACUCM fits into a design for which it was


Display ControlRemote Monitoring

System Configuration

Industrial Control

Control Plane,Statistics Gathering

DiagnosticInterfaces

Data Conversion,Complex Algorithm

Variable Data & MathManipulations

Ultra Controller-II

Other configuration solutions besides System ACE CF can load code and data into the PPC405 caches. Such solutions include

Xilinx Platform Flash as well as external processors and methods ...

Figure 4 – UltraController-II’s possible applications

not originally designed. Its small hard-ware and software footprint makes it suit-able for designs that already use all of theavailable FPGA resources.

Network-Attached Co-ProcessorIn a recent experiment, we combined theTEMAC UCM with the auxiliary proces-sor unit (APU) on the PPC405. The APUprovides a direct way to access hardwareaccelerators by the way of user-defined

instructions (UDI). Typically, a softwareapplication running on the PPC405 usesthe APU to accelerate the execution speed.In this case, however, the application soft-ware runs on a regular PC that distributesthe workload to multiple ML403 boardswith integrated TEMAC UCM and APU,as shown in Figure 5. The ML403 acts as anetwork-attached co-processor, speedingup the application running on the PC.

For example, we compute pictures gener-

ated by the Mandelbrot function. The soft-ware running on the PC divides the wholepicture into smaller areas and distributes theparameters for these smaller areas to theML403 boards through TCP connections.The PPC405 on each ML403 board readsthe parameters from the TCP socket, loopsthrough the area, and calculates every pixelusing a MandelPoint co-processor connect-ed to the APU interface. The PPC405 sendsthe results of this calculation back to the PCthrough the same TCP connection. Finally,the PC collects all the results from the net-work-attached co-processors and displaysthe final picture on the screen.

Figure 6 shows the result of the acceler-ation. A single ML403 connected to the PCcomputes a 1024 x 768 picture in aboutthree seconds, including communicationoverhead. Adding more ML403 boardsbrings the overall computation time downbelow 1.5 seconds for three boards. Fourboards do not show a significant improve-ment, as communication overheadbecomes the main part of the computation.

Solving the same problem natively on ahigh-end Linux workstation takes aboutthree seconds, about the same amount oftime as one network-attached ML403 withTEMAC UMC and APU acceleration.

ConclusionThe TEMAC UCM provides a conven-ient and inexpensive way of addingEthernet functionality to any design. Ituses a minimal amount of FPGAresources and the software runs from theembedded PPC405 caches. As a result ofthese simple interfaces, developmenttime is reduced. The Web server applica-tion is used to demonstrate the design,but you can use it for other solutions, asshown with the example of the network-attached co-processors.

For more information about theTEMAC UltraController Module, down-load XAPP807, including the referencedesign for the Xilinx ML403 board, and seethe Xilinx Webcast, “Learn about theUltraController-II Reference Design withPowerPC and Tri-Mode EMAC in Virtex-4FPGAs,” at www.xilinx.com/events/webcasts/110105_ultracontroller2.htm.


Switch ML403 192.168.1.2

MAC PPC405

APU

ML403 192.168.1.3

MAC PPC405

APU

ML403 192.168.1.4

MAC PPC405

APU

PC192.168.1.100

Number of Boards vs. Amount of Time Taken

Number of Boards

Tim

e/S

ec

3.5

3

2.5

2

1.5

1

0.5

00 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Figure 5 – Several ML403s with APU activated to accelerate computation

Figure 6 – Use multiple boards to shorten computation time

by Jay GouldProduct Marketing Manager, Xilinx Embedded Solutions MarketingXilinx, [email protected]

Do you always have the best balance ofprocessor, performance, IP, and peripheralson your first embedded application designpartition? Would experimenting with cus-tomizable IP and a choice of hard- and/orsoft-processor cores give you a better feelfor the best fit to your embedded applica-tion before committing to devices andhardware? Programmable platforms allowpowerful customization of processor-basedsystems, but which one do you start with?Why not utilize a development kit thatsupports it all?

Ideally, modern development kits pro-vide a comprehensive design environmentwith everything embedded developersrequire to create processor-based systems.They should include both the hardwareplatform and software tools so that youcan get started creating modules to proveout a design concept before the time-con-

suming task of building custom hard-ware. By starting with a “working” hard-ware platform, embedded teams canevaluate the merits of one processor coreversus another, begin writing applicationcode, and even experiment with IPperipherals before prematurely commit-ting to a specific hardware design.

A development kit should provide astable hardware evaluation environment,complete with all required cables andprobes, so that you can focus on yourown application and not worry aboutdebugging problems with a new board orbroken connection. Imagine the time-to-market advantages of being able to imme-diately write, test, and debug code on aworking reference board well in advanceof any custom hardware being ready.

Processor “P” or Processor “M”?Choosing the most appropriate processorfor a specific embedded application isalways a design concern. A processor usedon a previous project may not be a good fitfor the next project. Requirements for per-formance, features, size, and cost may

change wildly from one product to thenext. Having multiple processor corechoices and the flexibility to change thedesign after the development cycle hasstarted is highly desirable.

Xilinx offers system developers thechoice of high-performance, “hard”PowerPC™ or flexible “soft” Xilinx®

MicroBlaze™ processor cores for embed-ded designs. The PowerPC cores providehigh performance because they are hard-immersed into the Virtex™-4 FPGAdevice fabric. They also have additional per-formance-enhancing features, such as inte-grated 10/100/1000 Mbps Ethernet andAuxiliary Processor Unit (APU) controllers.

The 32-bit RISC MicroBlaze soft IPprocessor can be instantiated in any of theXilinx Platform FPGAs because it is builtlike a macro out of FPGA elements.Although the hard PowerPC processor willalways be faster and smaller by nature, theMicroBlaze soft-processor core has theflexibility to instantiate at any time dur-ing the design cycle. MicroBlaze cores canalso be used as complements to PowerPCcores in Virtex-4 platforms, often provid-

Development Kits Accelerate Embedded DesignDevelopment Kits Accelerate Embedded Design


A single Virtex-4 FX12Development Kit supports both PowerPC and MicroBlazeprocessor design.

A single Virtex-4 FX12Development Kit supports both PowerPC and MicroBlazeprocessor design.

ing workload distribution to optimize totalsystem performance.

With a unified strategy of supportingboth processors with the same IP libraryand design tool suite, you have the power tochoose the best processor/IP configurationfor any given design requirement. With thisscenario, you do not have to throw awaygood work or start a design over again withdifferent tools or IP libraries just because aprocessor change has been made.

Best of Both WorldsThe Virtex-4 FX family of FPGAs com-bines the best of both hard- and soft-pro-cessing options with a unique format forthe embedded industry. For example, thepowerful Virtex-4 FX12 device implement-ed on the Xilinx ML403 developmentboard (Figure 1) offers an immersedPowerPC core at a value price point. Thissame device and board can easily supportthe addition of MicroBlaze soft-processordesigns, creating one unique platform sup-porting two core options.

Engineers researching the appropriateprocessor choice for their next project cannow experiment with a broader set ofprocessor and IP offerings while makingtheir decision. With a single tool suite andIP library supporting both processoroptions, you can really do an apples-to-apples evaluation of the benefits versus yourown design requirements for the best fit.

The ML403 development board pro-vides a rich set of features, including periph-erals, memory, audio, video, and userinterfaces, enabling you to easily and cost-effectively prototype your embedded systemdesign. The actual board itself includes:

• Memory interfaces for DDR SDRAM,ZBT SRAM, and flash

• Audio and video interfaces

• Numerous user interfaces: dual PS/2,IIC Bus, RS-232, USB, and tri-modeEthernet

• Multiple FPGA programming modes:Platform Flash, the System ACE™solution, Linear Flash, and PC4

• Support for multiple clock sources anddifferential clock inputs

for Virtex-4 FX12 designs. The kit compris-es the following contents (Figure 2):

• Virtex-4 FX12 development board – ML403

• Platform Studio embedded tool suite and Embedded Development Kit (EDK)

• ISE™ (Integrated SoftwareEnvironment) FPGA tools

• JTAG probe, Ethernet, and serial cables

• ChipScope™ Pro Analyzer (Evaluation Version)

• Reference designs

In addition to the ML403 board, the kitshowcases the IEC DesignVision award-winning Xilinx Platform Studio (XPS)embedded tool suite. XPS is the integrateddevelopment environment that includes thedesign GUI, automated configuration wiz-ards, compiler, and debugger. XPS is builton the Eclipse framework and supports theGNU tool chain. Design wizards automatethe process of configuring the processor-based system, connect and customize the IP,and organize the project. Additionally,Platform Studio can automatically generateexample test code, software drivers, and evenBSPs for some of the most popular RTOSs.

The Virtex-4 FX device also supports anAPU controller that provides a high-band-width interface between the PowerPC 405core and co-processors to execute custominstructions in the FPGA fabric.Additionally, the ML403 showcases theFX12 implementation of two fully inte-grated 10/100/1000 Ethernet MACs forsystem communication and managementfunctions. These FX capabilities enabledevelopers to optimize their designs forhigh performance.

Completely Integrated Development KitThe Xilinx PowerPC and MicroBlazeDevelopment Kit Virtex-4 FX12 Editionintegrates a complete environment forembedded development. The kit supportsboth the PowerPC 405 immersed hardprocessor and the MicroBlaze soft processor


Figure 1 – ML403 development board

Figure 2 – Virtex-4 FX12 PPC/MicroBlaze Development Kit

XPS is fully aware of the ML403 devel-opment board and, unlike “universal”tools, automatically populates the GUIwith the feature options supported by thatboard. Intelligent tools automate the flow,reduce learning curves, and accelerate thedesign process. Users are amazed that theycan create a basic embedded system usingPlatform Studio’s automated flow in lessthan 15 minutes.

EDK bundles the XPS tool suite withthe impressive embedded IP library andMicroBlaze processor license. This is a one-time license; there is no royalty to be paidfor MicroBlaze designs shipped. The IPlibrary supports the CoreConnect bus,bridge, and arbiters, and offers a largeperipheral catalog including more than 60memory controllers, Ethernet, UARTs,timers, GPIO, DMA, and interrupt con-trollers, as well as many other cores.

ISE FPGA tools are the design utilitiesemployed for the FPGA implementation,including entry, synthesis, verification,and place and route. This design flow canbe invoked directly from the PlatformStudio IDE.

Target boards need a connection forthe various kinds of communication com-ing from the host computer where toolsare executed and design files created. Themost common method of connection toan embedded target board is through anindustry-standard JTAG probe. Xilinxoffers a choice of either parallel or USBfor JTAG probes; this single connectioncan be used for both FPGAdownload/debugging and embedded soft-ware download/debugging. This capabili-ty reduces your need for multiple probesand the inconvenience of constantlyswapping probes. The FX12 kit alsoincludes Ethernet and serial cables.

One unique, powerful feature thatXilinx offers is “platform debug” – the inte-gration of the embedded software debuggerand hardware debugger. By integrating thecross-triggering of both hardware and soft-ware debuggers, embedded engineers cannow find and fix system bugs faster. Can’tdetect the problem when the software goesoff into the weeds? Have the softwaredebugger cross-trigger the hardware debug-

ger so that you can examine the state of thehardware. Alternatively, if you locate aproblem after a hardware event, cross-trig-ger the software debugger to step throughthe code that is executing at that time. TheFX12 kit includes an evaluation version ofthe Xilinx ChipScope Pro hardware debug-ging tool to encourage you to explore themerits of platform debug.

Pre-Verified Reference DesignsThe last and critical ingredient of the inte-grated development kit that can really jump-start your entire design process is a collectionof reference designs or reference systems. Byincluding pre-verified or pre-existing work-ing example designs with the kit, you canunpack the box and have the basic system upand running in minutes. Reference designscan prove that the hardware and connectionsare working before you start creating newcode or IP, and prevent you from mistakenlydebugging your design when what you reallyhave is a bad board or cable.

These reference systems also act asexamples for the broad set of features onthe ML403 platform, such as Ethernet,DDR memory, video, and audio functions.You can use these examples as templates toget your own design features modeled, orrun as-is if your custom board targets thesame feature.

The FX12 Development Kit includesboot-able eOS (embedded operating sys-tems) or RTOS examples for Wind RiverSystems’ VxWorks and MontaVista embed-ded Linux from a Compact Flash device. TheML403 board is supported by other third-party embedded RTOS or hardware/softwaredesign tool partners as well.

Sample reference systems include:

• PowerPC Reference System withVxWorks and Linux

• MicroBlaze Reference System

• DCM Phase Shift Using MicroBlazeReference System

• UltraController-II Reference Design

• Gigabit System Reference Design(GSRD)

• APU Co-Processing Acceleration

These reference systems can save youdays and even months of development timecompared to manually generating everydesign module yourself. Leveraging existingexamples jump-starts your design cycle.

ConclusionA complete development environment ofhardware board, design tools, IP, and pre-verified reference designs can dramatical-ly accelerate embedded development.Additionally, for a quick, out-of-the-boxexperience, the ideal development envi-ronment also provides all of the extraJTAG probe, serial, and Ethernet cables.Intelligent, award-winning design toolsthat are “platform aware” make the kiteasy to use, providing wizards that helpyou configure the system by automatingthe flow for the known features on thedevelopment board.

The programmable platform and intelli-gent XPS tools provided with the PowerPCand MicroBlaze Development Kit enableyou to craft embedded systems with theoptimal combination of features, perform-ance, area, and cost. You can choose themost effective processor core for the targetapplication, customize IP, optimize the per-formance, and validate the software on adevelopment board before your own hard-ware is even back from the shop.

Engineers need options and flexibilitywhen trying to satisfy all of their projectrequirements. The Xilinx Virtex-4 FX12Development Kit uniquely offers twoprocessor cores for easy evaluation in a sin-gle, integrated environment. A single IPlibrary and common tool suite unifies thedesign environment for both cores, pro-viding total embedded architecture flexi-bility. The inclusion of numerouspre-verified reference designs empowersyou to jump-start the development cycleso that you can spend more time addingvalue to your end application.

To learn more about the low-cost Virtex-4 FX12 Development Kit supporting bothPowerPC and MicroBlaze processor cores,please visit www.xilinx.com/embdevkits. Agood starting point to learn about all ofour embedded processing solutions iswww.xilinx.com/processor.


Finding a processor to meet

performance, feature, and cost

targets can be very challenging

in today’s competitive environ-

ment. With the continued

advances in FPGA technology

features, increased performance

and higher density devices,

scalable processor systems can

now be offered as an economical,

superior alternative for real-time

processing needs. A flexible

processor system that’s easy-to-

use, area-efficient, optimized

for cost-sensitive designs, and

able to support you well into

the future is delivered by

the award-winning Xilinx

MicroBlaze™ solution.

The Xilinx MicroBlaze

processor, named to EDN’s Hot

100 Products of 2005, is a 32-bit

RISC “soft” core operating at up

to 200 MHz on the Virtex™-4

FPGA and costs less than 50

cents to implement in the

Spartan™-3E FPGA. Because the

processor is a soft core, you can

choose from any combination

of highly customizable features

that will bring your products

to market faster, extend your

product’s life cycle, and avoid

processor obsolescence.

The MicroBlaze Advantage Xilinx unleashes the potential of embedded FPGA designs with the award-winningMicroBlaze soft processor solution. The MicroBlaze core is a 3-stage pipeline 32-bitRISC Harvard architecture soft processor core with 32 general purpose registers, ALU,and a rich instruction set optimized for embedded applications. It supports both on-chip block RAM and/or external memory. With the MicroBlaze soft processor solution,you have complete flexibility to select any combination of peripherals, memory andinterface features that you need to give you the best system performance at the lowestcost on a single FPGA.

Hardware Acceleration using Fast Simplex Link The MicroBlaze Fast Simplex Link (FSL) lets you connect hardware co-processors toaccelerate time-critical algorithms. The FSL channels are dedicated uni-directionalpoint-to-point data streaming interfaces. Each FSL channel provides a low latency interface to the processor pipeline making them ideal for extending the processor’s execution unit with custom hardware accelerators.

Floating-Point Unit Support MicroBlaze introduces an integrated single precision, IEEE-754 compatible Floating PointUnit (FPU) option optimized for embedded applications such as industrial control,automotive, and office automation. The MicroBlaze FPU provides designers with aprocessor tailored to execute both integer and floating point operations.

Hardware Configurability The MicroBlaze processor solution provides a high level of configurabilty to tailor theprocessor sub-system to the exact needs of the target embedded application.Configurable features such as: the barrel shifter, divider, multiplier, instruction and datacaches, FPU, FSL interfaces, hardware debug logic, and the hardware exceptions, providegreat flexibility but does not add to the cost if they are not used.

Award-Winning Platform Studio Tool Suite The Embedded Development Kit (EDK) is an all encompassing solution for designingembedded programmable systems. This pre-configured kit includes the award-winningPlatform Studio™ Tool Suite, the MicroBlaze soft processor core as well as all the documen-tation and soft peripheral IP that you require for designing FPGA-based embedded processor systems.

MicroBlaze – The Low-Cost and Flexible Processing Solution

www.edn.com


MicroBlaze Hardware Options and Configurable Blocks Hardware Functions • Hardware Barrel Shifter • Hardware Divider• Machine Status Set and Clear Instructions • Hardware Exception Support • Pattern Compare Instructions • Floating-Point Unit (FPU) • Hardware Multiplier Enable • Hardware Debug Logic

Cache and Cache Interface • Data Cache (D-Cache) • Instruction Cache (I-Cache) • Instruction-side Xilinx Cache Link (IXCL) • Data-side Xilinx Cache Link (DXCL)

Bus Infrastructure• Data-side On-chip Peripheral Bus (DOPB) • Instruction-side On-chip Peripheral Bus (IOPB) • Data-side Local Memory Bus (DLMB) • Instruction-side Local Memory Bus (ILMB) • Fast Simplex Link (FSL)

Take the Next Step Visit our website www.xilinx.com/microblaze for more information .To order your Embedded Development Kit visit www.xilinx.com/edk.

Virtex-II Pro

Device Family Max Clock Frequency

Virtex-4

Spartan-3

Note: The maximum clock frequency and maximum Dhrystone 2.1 performance benchmarks are notbased on the same system. Depending on the configured options, the MicroBlaze soft processor core size is about 900 – 2600 LUTs.

170 MHz

200 MHz

100 MHz

138 DMIPS

166 DMIPS

92 DMIPS

Max Dhrystone 2.1 Performance

Instruction-sidebus interface

Data-sidebus interface

IXCL DXCL

FSL

BusIF

I-Cach

e

BusIF

D-C

ache

InstructionBuffer

InstructionDecode

ProgramCounter Special

PurposeRegisters

ALU

Shift

Barrel Shift

Multiplier

Divider

Register File32 X 32b

FPU

IOPB

ILMB

DOPB

DLMB

Basic Processor Functions Configurable Functions Designer Defined Blocks Peripherals – Xilinx or 3rd Party or Designer Defined

On-chip Peripheral Bus (OPB)

HW/SWAccelerator

OPBMemory

Controller

OPBPeripherals

OPBMemory

Controller

Multi-ChannelMemory

Controller

Block RAM

Multi-ChannelMemory

Controller

Embedded Development Kit and Platform Studio Tool Suite For development, Xilinx offers the Embedded Development Kit (EDK),which is the common design environment for both MicroBlaze andPowerPC-based embedded systems. The EDK is a set of microprocessordesign tools and common software platforms, such as device driversand protocol stacks. The EDK includes the Platform Studio tool suite,the MicroBlaze core, and a library of peripheral IP cores.

Using these tools, design engineers can define the processor subsystemhardware and configure the software platform, including generating aBoard Support Package (BSP) for a variety of development boards.Platform Studio Software Development Kit (SDK) is based on theEclipse open-source C development tool kit and includes a full-featureddevelopment environment and a feature-rich GUI debugger. TheMicroBlaze processor is supported by the GNU compiler and debuggertools. The debugger connects the MicroBlaze via JTAG. For debuggingvisibility and control over the embedded system, design engineers canadd the ChipScope Pro™ verification tools from Xilinx, which are integrated into the hardware/software debug capabilities of the EDK.

Corporate Headquarters

Xilinx, Inc.

2100 Logic Drive

San Jose, CA 95124

Tel: (408) 559-7778

Fax: (408) 559-7114

Web: www.xilinx.com

European Headquarters

Xilinx

Citywest Business Campus

Saggart,

Co. Dublin

Ireland

Tel: +353-1-464-0311

Fax: +353-1-464-0324

Web: www.xilinx.com

Japan

Xilinx, K.K.

Shinjuku Square Tower 18F

6-22-1 Nishi-Shinjuku

Shinjuku-ku, Tokyo

163-1118, Japan

Tel: 81-3-5321-7711

Fax: 81-3-5321-7765

Web: www.xilinx.co.jp

Asia Pacific

Xilinx Asia Pacific Pte. Ltd.

No. 3 Changi Business Park Vista, #04-01

Singapore 486051

Tel: (65) 6544-8999

Fax: (65) 6789-8886

RCB no: 20-0312557-M

Web: www.xilinx.com

Distributed By:

© 2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. PowerPC is a trademark of IBM, Inc. All other trademarks are the property of their respective owners.


INTEGRATED SYSTEM DESIGN + DESIGN FOR MANUFACTURING + ELECTRONIC SYSTEM LEVEL DESIGN + FUNCTIONAL VERIFICATION

THE EDA TECHNOLOGY LEADER

©2006 Mentor Graphics Corporation. All Rights Reserved. Mentor Graphics is a registered trademark of Mentor Graphics Corporation.

/**********************************//* FPGA embedded challenge *//**********************************/

volatile project;const short schedule;complex hardware, software;

if (FPGA && CPU) { success = seamless_FPGA(symbolic_debug, early_HW_access, easy2use); performance = 1000 * faster; visibility++;}else { visibility = NULL; performance = SLOW; vacation = 0;}

// evaluate it at www.seamlessFPGA.com

Hardware/Software Integration You work late nights and weekends, but that can only

get you so far when you have a huge device with logic, memory, processors, complex IP –

and a tight schedule. How do you ensure that hardware/software integration goes

smoothly? Seamless FPGA from Mentor Graphics is the only hardware/software

integration tool that matches support for the Virtex FPGA family with models for

the PowerPC and MicroBlaze processor architectures, giving you full software and

hardware visibility and the performance you need to spot hardware/software integration

issues early in the design cycle. To learn more about Seamless FPGA, go to

www.seamlessFPGA.com, email [email protected] or call 800.547.3000.

®®

TO UNDERSTANDNOT ALL CODE IS HARD

SeamlessFPGA_PrintAd.eps 3/6/06 8:53:33 AM

Discover endless possibilities as Xilinx brings you hands-on, FREE workshopsat ESC Silicon Valley. You’ll gain valuable embedded design experience in realtime. And only Xilinx Platform FPGAs offer programmable flexibility with theoptimal choice of integrated hard and/or soft processors — that means fastand flexible solutions for your processing needs.

FREE Hands-On Workshops (90 minutes each) Room C1

� High Performance Design with the Embedded PowerPC and Co-Processor Code Accelerators (April 4 — 10am, 2pm & 4pm)

� Learn to Build and Optimize a MicroBlaze Soft Processor System in Minutes (April 5 — 10am, Noon, 2pm & 4pm)

� Design and Verify Video/Imaging Algorithms with the Xilinx Video Starter Kit (April 6 — 9am, 11am & 1pm)

Live Demonstrations

Explore embedded processing with the flexible PowerPC™ and MicroBlaze32-bit processor cores, plus XtremeDSP performance and our PCI Expresssolutions. And see for yourself why the embedded industry has recognizedaward-winning products like Platform Studio and MicroBlaze.

Xilinx will also offer numerous Theatre Presentations and Technical ProgramPapers. Check in at booth #315 for a program schedule, and find out whyXilinx is embedded processing.

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Visit Xilinx at Embedded Systems Conference Silicon Valley, April 4 –6, 2006 — Booth #315IN-DEPTH PRODUCT DEMONSTRATIONS & PRESENTATIONS | FREE HANDS-ON WORKSHOPS

Xilinx isEmbedded Processing

PN 0010924

embedded solutions for programmable logic designs - xilinx · for this year’s embedded systems...

Documents