3d vision: developing an embedded stereo-vision system

106 Computer

E M B E D D E D C O M P U T I N G

M any products would ben-efit greatly from the abil-ity to see people andobjects and respond toevents in real time. Cars,

for example, could detect pedestriansor objects in harm’s way and brakeautomatically. Agricultural equipmentcould navigate fields autonomously,avoiding stray objects, people, andwildlife. Security systems could trackpeople moving through a building.

Any vision system deployed in thereal world must be able to operate ina broad range of conditions. Lightingcan vary dramatically. Rain, snow,and ice can occlude objects or altertheir appearance. People and objectscan be stationary or moving, and peo-ple can move in different ways and atdifferent speeds. Detecting a darklyclad pedestrian racing across an unlitalley in a thunderstorm requires visual

sophistication far beyond the simpleidentification of stationary objects in awell-lit laboratory.

STEREO VISIONThe most promising solution for

promptly and effectively interpretingvisual data in these challenging con-ditions is 3D vision, and the mosteffective way of realizing 3D vision isstereo vision. Using two cameras side by side, stereo vision produces avirtually instantaneous estimate of the distances to elements in a scene.Detecting distances serves as a pri-mary cue for detecting things thatstand out in depth from a back-ground. In addition to quickly gaug-ing depth, stereo vision is highlyeffective for segmenting objects andgauging their size and shape. Ulti-mately, stereo vision simplifies datainterpretation.

Accurate, real-time responsivenesswould be difficult, if not impossible,to achieve cost-effectively using 2Dsystems, which rely on a single imager.Besides being more error-prone atgauging distances, 2D-vision systemscan be easily confounded by changinglighting conditions.

Now that low-cost complementarymetal-oxide semiconductor (CMOS)imagers have come on the market, theopportunity to develop a general-pur-pose 3D embedded-vision system hasnever been better.

The DeepSea G2 Stereo VisionSystem that Tyzx developed featuresan embedded stereo camera consistingof two CMOS imagers, a TyzxDeepSea 2 stereo application-specificintegrated circuit (ASIC), a field-pro-grammable gate array (FPGA), a DSP/coprocessor, a PowerPC runningLinux, and an Ethernet connection.About the size of a hardback book, theG2 delivers real-time—30 frames persecond (fps)—interpretations of visualdata even in challenging environmentsand has been deployed in a variety ofapplications, including person-trackingand autonomous-vehicle navigation.

EFFICIENT PROCESSINGThe richer and more immediately

useful results that 3D vision systemsproduce come with a price. Processingdata from stereo-imager inputs in-creases the system’s computationalrequirements—and potentially itscost. Our biggest design challenge wasto address the computational needs ofstereo vision in a platform that wouldremain affordable, compact, andcapable of running for long periods onlittle power.

One way to minimize the embeddedcomputational requirements would beto collect image data in the vision sys-tem, then offload the data processingto another system or series of systemson a local network. We quickly real-ized, however, that the bandwidthrequirements for multicamera appli-cations would make this approachuntenable.

Consider, for example, a wide-areatracking system based on stereo

3D Vision:Developing anEmbedded Stereo-Vision SystemJohn Iselin Woodfill, Ron Buck, Dave Jurasek, Gaile Gordon, and Terrance Brown Tyzx

Stereo-vision systems can

provide accurate real-time data

in a variety of applications.

May 2007 107

tiating foreground images frombackground images;

• mapping data into a 3D quantizedprojection space; and

• completing other general-purposeprocessing tasks.

At the simplest level, the computa-tion required for a vision system takesinput from left and right imagers,processes data, and outputs results foruse by an application.

First-generation productIn our first-generation embedded

product, we programmed an FPGA toperform image rectification and devel-oped an ASIC, the DeepSea 2 chip, to perform stereo correlation. Theremaining processing was performedon a 600-megahertz Pentium III chip,which also ran the user application.The FPGA, ASIC, and PIII communi-cated over a peripheral componentinterconnect (PCI) bus. The systemran at 15 hertz and processed imagesthat were 400 pixels wide by 300 pix-els high.

For applications in which a visionsystem must make rapid decisionsabout complex, changing data, it’s bestto have many frames of data to confirmthe system’s interpretation of a scene.While our first-generation product rep-resented a milestone in visual-systemintegration, its 15-fps performancewasn’t sufficiently fast for some real-time applications, such as automotivesafety. The system also lacked the com-puting power required for trackingobjects—a common requirement inmany vision applications.

In addition to increasing the sys-tem’s performance, we also wanted to

imagers (one color and one mono-chrome), producing 26 bits (10 bits Y,8 bits U and V) and 10 bits per pixel,respectively. For 640 � 480 � 30 fps,the imagers will output approximately300 Mbits of data per second. If wefeed this data into a stereo processorand down-sample the color image, wemight end up with as little as 200 Mbitsof data per second. By processing therange and color data to detect andlocate people, we can further reduce theinformation to a few bytes per persontracked per frame, resulting in perhapsonly 80 Kbits per second. For just onetracking camera, a conventional LANcould accommodate the resulting dataat any stage in the pipeline.

However, for 100 tracking cameras,the left and right source data becomes33 Gbits per second, the color andrange 20 Gbits, and the segmentedtrack data might be about 8 Mbits persecond—in total, an amount of datafar beyond the capacity of most net-works and sufficiently large to callinto question the system’s real-timecapabilities.

Image processing, therefore, is mostpractically performed within the sys-tem itself. A self-contained system isalso the most practical for applica-tions such as automotive safety andconsumer appliances, where offload-ing to another processor would beimpractical.

FUNCTIONAL DECOMPOSITIONInstead of offloading the processing,

we solved the computational problemby applying functional decomposi-tion, breaking down the computa-tional work into smaller tasks, andthen putting as much processing aspossible into hardware primitives.

We decomposed the data-process-ing work into these steps:

• rectification—aligning image datato account for imperfections andmisalignments in the imagers;

• stereo correlation—correlatingdata from both imagers to createone master set of data;

• background modeling—applyinga background model and differen-

lower its overall processing and powerrequirements to make it easier todeploy where space is limited.

Second-generation productAfter considering various tradeoffs,

we invested in more integration up-front for our second-generation prod-uct, the DeepSea G2. By applyingadditional functional decompositionin the G2, we increased the system’sperformance while decreasing CPUand power requirements.

The G2 passes even more work tohardware primitives in the FPGA andASIC. For example, an FPGA nowperforms background-modeling prim-itives previously performed on a PIII.This tighter integration and hardwareprocessing enabled the system toachieve process images at 30 fps in asmaller package.

Table 1 compares the functionaldecomposition of the two products.Whereas the first-generation producttook advantage of an off-the-shelfCPU, the second-generation productrelies on an FPGA for backgroundmodeling and individuation.

We chose hardware-acceleratedvisual primitives that would be usefulfor many tasks, account for a largepart of the computational load, andbe suitable for hardware acceleration.In the G2, some primitives process theincoming streams of left and rightimages. Other primitives process theoutput of other primitives.

Since the G2 is an embedded stereocamera, the visual primitives areaimed at producing and interpretingrange data. Since stereo correlation iscomputationally intensive and a com-mon subcomponent of vision algo-

Table 1: Comparing functional decomposition in two Tyzx products.

Task First-generation product DeepSea G2 Stereo Vision

Rectification FPGA FPGA Stereo correlation DeepSea 2 ASIC DeepSea 2 ASIC Background model Pentium III FPGA Projection space Pentium III FPGA DSP coprocessing N/A BlackFin DSP coprocessor User application Pentium III PowerPC

108 Computer

E M B E D D E D C O M P U T I N G

rithms, it’s an obvious visual primitiveto support. Background modelingbased on range and color requireshuge amounts of memory bandwidthand finds a place in many trackingalgorithms—hence it’s our secondvisual primitive. A third primitive,ProjectionSpace, produces 2D and 3Dquantized representations or projec-tions of the range data. This is a com-putationally intensive, but widelyapplicable, visual primitive. Lastly, aprogrammable DSP is included in theG2 as a generalized visual primitive—an additional resource for doingexpensive, regular image operations.

Stereo correlation. The primaryvisual primitive is the stereo-correla-tion processor. It takes as input the leftand right images and creates a range

image as output. The Tyzx DeepSea 2ASIC accelerates the performance ofthis primitive.

Background modeling. The back-ground-modeling primitive takes asinput range and intensity data and gen-erates a pixel-by-pixel foreground/background map of the image basedon the nature of the pixel’s changefrom previous frames. This task canrequire reading roughly 400 bits perpixel from memory and writingroughly 200 bits per pixel to memory.For 400 x 300 images at 30 fps, back-ground modeling requires over 2 Gbitsof memory bandwidth per second.

In addition, matching and updatingeach pixel involves several compar-isons and updates. Offloading this taskto dedicated hardware reduces theCPU’s workload and is a key factorenabling the use of a smaller embed-ded CPU. The output is a binary fore-ground mask image.

ProjectionSpace. The Projection-Space primitive transforms a full 3Dcloud of points into either 3D dataprojected into a Euclidean 3D quan-tized volume or a 2D array. These 3Dvolumes or 2D arrays are useful formany applications, such as peopletracking. The ProjectionSpace primi-tives transform vast amounts of 3Ddata into forms that make sense forapplications tracking people andobjects in a 3D space.

OTHER DESIGN CHALLENGESThe technical hurdles for such a

vision system also include size, porta-bility, and power requirements. The

design goal, in most cases, is to unob-trusively add a vision system to anexisting product or environment. Thatmeans deploying a solution that’scompact and probably lightweightand makes few power and coolingdemands.

Table 2 details the G2’s specifica-tions. The system consumes about 15watts; its camera includes a 100BaseTEthernet interface that accepts Power-over-Ethernet. A compact flash mem-ory card stores a Linux kernel and aroot file system so that the G2 canboot into Linux on power-up.

I ncreased hardware integration andacceleration in the G2 has paid off.The system can track people and

objects in changing lighting condi-tions, using 512 � 380 images at 30fps—a rate fast enough to supportrapid decision making and deliver real-time responsiveness. Electroland, acompany that creates interactive envi-ronments, has installed four G2s run-ning our person-tracking applicationin their interactive installation forTarget in Rockefeller Center’s newobservation deck. Another partner isdeveloping an autonomous urban-reconnaissance vehicle that uses theG2 to run obstacle-detection and path-planning algorithms.

In the future, we expect to integrateeven more functionality in hardwareto create embedded stereo systemsthat are smaller, faster, and easier tointegrate in products working in thereal world. ■

John Iselin Woodfill is the chief technol-ogy officer; Ron Buck is president andCEO; Gaile Gordon is vice president,advanced development; Dave Jurasek isvice president, hardware engineering;and Terrance Brown is senior hardwaredesign engineer, all at Tyzx. Contact theauthors at [email protected].

Editor: Wayne Wolf, Dept. of Electrical Engineering, Princeton University, Princeton NJ; [email protected]

Table 2.Tyzx DeepSea 2 specifications.

Tyzx DeepSea 2 Stereo-Correlation ChipSpecification Unit of measure

Input image size (max) 512 � 2048 (10 bit) pixels Stereo range

Search window 5 2 disparities Subpixel localization 5 bits Z output 16 bits Max frame rate 200 fps (512 � 480)

Power < 1 watt Pixel disparities/second 2.6 billion

Visionary Web Videosabout the Future of Multimedia.

Listen to premiere multimedia experts!Post your own views and demos!

Visit www.computer.org/multimedia

3d vision: developing an embedded stereo-vision system

Documents