3d vision: developing an embedded stereo-vision system
Post on 22-Sep-2016
Embed Size (px)
E M B E D D E D C O M P U T I N G
M any products would ben-efit greatly from the abil-ity to see people andobjects and respond toevents in real time. Cars,for example, could detect pedestriansor objects in harms way and brakeautomatically. Agricultural equipmentcould navigate fields autonomously,avoiding stray objects, people, andwildlife. Security systems could trackpeople moving through a building.
Any vision system deployed in thereal world must be able to operate ina broad range of conditions. Lightingcan vary dramatically. Rain, snow,and ice can occlude objects or altertheir appearance. People and objectscan be stationary or moving, and peo-ple can move in different ways and atdifferent speeds. Detecting a darklyclad pedestrian racing across an unlitalley in a thunderstorm requires visual
sophistication far beyond the simpleidentification of stationary objects in awell-lit laboratory.
STEREO VISIONThe most promising solution for
promptly and effectively interpretingvisual data in these challenging con-ditions is 3D vision, and the mosteffective way of realizing 3D vision isstereo vision. Using two cameras side by side, stereo vision produces avirtually instantaneous estimate of the distances to elements in a scene.Detecting distances serves as a pri-mary cue for detecting things thatstand out in depth from a back-ground. In addition to quickly gaug-ing depth, stereo vision is highlyeffective for segmenting objects andgauging their size and shape. Ulti-mately, stereo vision simplifies datainterpretation.
Accurate, real-time responsivenesswould be difficult, if not impossible,to achieve cost-effectively using 2Dsystems, which rely on a single imager.Besides being more error-prone atgauging distances, 2D-vision systemscan be easily confounded by changinglighting conditions.
Now that low-cost complementarymetal-oxide semiconductor (CMOS)imagers have come on the market, theopportunity to develop a general-pur-pose 3D embedded-vision system hasnever been better.
The DeepSea G2 Stereo VisionSystem that Tyzx developed featuresan embedded stereo camera consistingof two CMOS imagers, a TyzxDeepSea 2 stereo application-specificintegrated circuit (ASIC), a field-pro-grammable gate array (FPGA), a DSP/coprocessor, a PowerPC runningLinux, and an Ethernet connection.About the size of a hardback book, theG2 delivers real-time30 frames persecond (fps)interpretations of visualdata even in challenging environmentsand has been deployed in a variety ofapplications, including person-trackingand autonomous-vehicle navigation.
EFFICIENT PROCESSINGThe richer and more immediately
useful results that 3D vision systemsproduce come with a price. Processingdata from stereo-imager inputs in-creases the systems computationalrequirementsand potentially itscost. Our biggest design challenge wasto address the computational needs ofstereo vision in a platform that wouldremain affordable, compact, andcapable of running for long periods onlittle power.
One way to minimize the embeddedcomputational requirements would beto collect image data in the vision sys-tem, then offload the data processingto another system or series of systemson a local network. We quickly real-ized, however, that the bandwidthrequirements for multicamera appli-cations would make this approachuntenable.
Consider, for example, a wide-areatracking system based on stereo
3D Vision:Developing anEmbedded Stereo-Vision SystemJohn Iselin Woodfill, Ron Buck, Dave Jurasek, Gaile Gordon, and Terrance Brown Tyzx
Stereo-vision systems canprovide accurate real-time datain a variety of applications.
May 2007 107
tiating foreground images frombackground images;
mapping data into a 3D quantizedprojection space; and
completing other general-purposeprocessing tasks.
At the simplest level, the computa-tion required for a vision system takesinput from left and right imagers,processes data, and outputs results foruse by an application.
First-generation productIn our first-generation embedded
product, we programmed an FPGA toperform image rectification and devel-oped an ASIC, the DeepSea 2 chip, to perform stereo correlation. Theremaining processing was performedon a 600-megahertz Pentium III chip,which also ran the user application.The FPGA, ASIC, and PIII communi-cated over a peripheral componentinterconnect (PCI) bus. The systemran at 15 hertz and processed imagesthat were 400 pixels wide by 300 pix-els high.
For applications in which a visionsystem must make rapid decisionsabout complex, changing data, its bestto have many frames of data to confirmthe systems interpretation of a scene.While our first-generation product rep-resented a milestone in visual-systemintegration, its 15-fps performancewasnt sufficiently fast for some real-time applications, such as automotivesafety. The system also lacked the com-puting power required for trackingobjectsa common requirement inmany vision applications.
In addition to increasing the sys-tems performance, we also wanted to
imagers (one color and one mono-chrome), producing 26 bits (10 bits Y,8 bits U and V) and 10 bits per pixel,respectively. For 640 480 30 fps,the imagers will output approximately300 Mbits of data per second. If wefeed this data into a stereo processorand down-sample the color image, wemight end up with as little as 200 Mbitsof data per second. By processing therange and color data to detect andlocate people, we can further reduce theinformation to a few bytes per persontracked per frame, resulting in perhapsonly 80 Kbits per second. For just onetracking camera, a conventional LANcould accommodate the resulting dataat any stage in the pipeline.
However, for 100 tracking cameras,the left and right source data becomes33 Gbits per second, the color andrange 20 Gbits, and the segmentedtrack data might be about 8 Mbits persecondin total, an amount of datafar beyond the capacity of most net-works and sufficiently large to callinto question the systems real-timecapabilities.
Image processing, therefore, is mostpractically performed within the sys-tem itself. A self-contained system isalso the most practical for applica-tions such as automotive safety andconsumer appliances, where offload-ing to another processor would beimpractical.
FUNCTIONAL DECOMPOSITIONInstead of offloading the processing,
we solved the computational problemby applying functional decomposi-tion, breaking down the computa-tional work into smaller tasks, andthen putting as much processing aspossible into hardware primitives.
We decomposed the data-process-ing work into these steps:
rectificationaligning image datato account for imperfections andmisalignments in the imagers;
stereo correlationcorrelatingdata from both imagers to createone master set of data;
background modelingapplyinga background model and differen-
lower its overall processing and powerrequirements to make it easier todeploy where space is limited.
Second-generation productAfter considering various tradeoffs,
we invested in more integration up-front for our second-generation prod-uct, the DeepSea G2. By applyingadditional functional decompositionin the G2, we increased the systemsperformance while decreasing CPUand power requirements.
The G2 passes even more work tohardware primitives in the FPGA andASIC. For example, an FPGA nowperforms background-modeling prim-itives previously performed on a PIII.This tighter integration and hardwareprocessing enabled the system toachieve process images at 30 fps in asmaller package.
Table 1 compares the functionaldecomposition of the two products.Whereas the first-generation producttook advantage of an off-the-shelfCPU, the second-generation productrelies on an FPGA for backgroundmodeling and individuation.
We chose hardware-acceleratedvisual primitives that would be usefulfor many tasks, account for a largepart of the computational load, andbe suitable for hardware acceleration.In the G2, some primitives process theincoming streams of left and rightimages. Other primitives process theoutput of other primitives.
Since the G2 is an embedded stereocamera, the visual primitives areaimed at producing and interpretingrange data. Since stereo correlation iscomputationally intensive and a com-mon subcomponent of vision algo-
Table 1: Comparing functional decomposition in two Tyzx products.
Task First-generation product DeepSea G2 Stereo Vision
Rectification FPGA FPGA Stereo correlation DeepSea 2 ASIC DeepSea 2 ASIC Background model Pentium III FPGA Projection space Pentium III FPGA DSP coprocessing N/A BlackFin DSP coprocessor User application Pentium III PowerPC
E M B E D D E D C O M P U T I N G
rithms, its an obvious visual primitiveto support. Background modelingbased on range and color requireshuge amounts of memory bandwidthand finds a place in many trackingalgorithmshence its our secondvisual primitive. A third primitive,ProjectionSpace, produces 2D and 3Dquantized representations or projec-tions of the range data. This is a com-putationally intensive, but widelyapplicable, visual primitive. Lastly, aprogrammable DSP is included in theG2 as a generalized visual primitivean additional resource for doingexpensive, regular image operations.
Stereo correlation. The primaryvisual primitive is the stereo-correla-tion processor. It takes as input the leftand right images and creates a range
image as output. The Tyzx DeepSea 2ASIC accelerates the performance ofthis primitive.
Background modeling. The back-ground-modeling primitive takes asinput range and intensity data and gen-erates a pixel-by-pixel foreground/background map of the image basedon the nature of the pixels changefrom previous frames. This task ca