hexagon dsp:an architecture mobile...

10
. ................................................................................................................................................................................................................ HEXAGON DSP: AN ARCHITECTURE OPTIMIZED FOR MOBILE MULTIMEDIA AND COMMUNICATIONS . ................................................................................................................................................................................................................ THE QUALCOMM HEXAGON DSP IS USED FOR BOTH MODEM PROCESSING AND MULTIMEDIA ACCELERATION.BY OFFLOADING MULTIMEDIA TASKS FROM THE CPU TO THE DSP, SIGNIFICANT POWER SAVINGS CAN BE ACHIEVED.THIS ARTICLE PROVIDES AN OVERVIEW OF THE HEXAGON ARCHITECTURE.THE PROCESSOR IS DESIGNED TO DELIVER SUPERIOR ENERGY EFFICIENCY COMPARED TO MOBILE CPU ALTERNATIVES AND THEREBY HELP ACHIEVE LONG BATTERY LIFE FOR IMPORTANT MOBILE APPLICATIONS. ...... To be competitive, a modern mobile product must provide a rich user expe- rience and long battery life. Chips for these ecosystems integrate multiple subsystems, each customized for a particular application domain. By specializing a subsystem to a task, performance and power can be enhanced beyond what is possible with a homogenous CPU-based computing platform. Figure 1 shows a block diagram of the Snapdragon 800. This chip contains dedicated subsystems for camera, display, video, audio/ voice, sensors, graphics, cellular modem, Wi- Fi, and more. Each subsystem contains dedi- cated hardware, and many contain special- purpose processing engines and software cus- tomized to the task. The Snapdragon 800 has two instances of the Hexagon digital-signal processor (DSP). The modem (mDSP) is dedicated and customized for modem processing, whereas the application DSP (aDSP) is used for mul- timedia acceleration. The modem processor is a closed subsystem and is programmed only within Qualcomm Technologies. The multimedia DSP, however, is licensed for programming by OEMs and third-party soft- ware vendors. This article provides an over- view of the multimedia DSP and builds on the presentation from HotChips 25. 1 Figure 2 shows the various Hexagon gen- erations. Version 2 (V2) was the first pro- duction version and appeared in the initial Snapdragon mobile products in 2007. V3 featured an improved implementation with better power consumption. These early ver- sions of Hexagon targeted voice and audio processing. Example functions include wide- band vocoders, echo cancellation, audio postprocessing filters, MP3/AAC play- back, speaker protection algorithms, and so on. V4 and V5 expanded the application tar- gets to include image processing for camera and video; computer vision tasks such as hand, gesture, and face recognition; and processing of sensor input (gyro, accelerome- ter, fingerprint, and so on). Unless otherwise noted, this article will focus on the latest Hexagon V5 core. Lucian Codrescu Willie Anderson Suresh Venkumanhanti Mao Zeng Erich Plondke Chris Koob Ajay Ingle Charles Tabony Rick Maule Qualcomm ....................................................... 34 Published by the IEEE Computer Society 0272-1732/14/$31.00 c 2014 IEEE

Upload: vulien

Post on 12-Aug-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

.................................................................................................................................................................................................................

HEXAGON DSP: AN ARCHITECTUREOPTIMIZED FOR MOBILE MULTIMEDIA

AND COMMUNICATIONS.................................................................................................................................................................................................................

THE QUALCOMM HEXAGON DSP IS USED FOR BOTH MODEM PROCESSING AND

MULTIMEDIA ACCELERATION. BY OFFLOADING MULTIMEDIA TASKS FROM THE CPU TO THE

DSP, SIGNIFICANT POWER SAVINGS CAN BE ACHIEVED. THIS ARTICLE PROVIDES AN

OVERVIEW OF THE HEXAGON ARCHITECTURE. THE PROCESSOR IS DESIGNED TO DELIVER

SUPERIOR ENERGY EFFICIENCY COMPARED TO MOBILE CPU ALTERNATIVES AND THEREBY

HELP ACHIEVE LONG BATTERY LIFE FOR IMPORTANT MOBILE APPLICATIONS.

......To be competitive, a modernmobile product must provide a rich user expe-rience and long battery life. Chips for theseecosystems integrate multiple subsystems,each customized for a particular applicationdomain. By specializing a subsystem to a task,performance and power can be enhancedbeyond what is possible with a homogenousCPU-based computing platform.

Figure 1 shows a block diagram of theSnapdragon 800. This chip contains dedicatedsubsystems for camera, display, video, audio/voice, sensors, graphics, cellular modem, Wi-Fi, and more. Each subsystem contains dedi-cated hardware, and many contain special-purpose processing engines and software cus-tomized to the task.

The Snapdragon 800 has two instances ofthe Hexagon digital-signal processor (DSP).The modem (mDSP) is dedicated andcustomized for modem processing, whereasthe application DSP (aDSP) is used for mul-timedia acceleration. The modem processoris a closed subsystem and is programmedonly within Qualcomm Technologies. The

multimedia DSP, however, is licensed forprogramming by OEMs and third-party soft-ware vendors. This article provides an over-view of the multimedia DSP and builds onthe presentation from HotChips 25.1

Figure 2 shows the various Hexagon gen-erations. Version 2 (V2) was the first pro-duction version and appeared in the initialSnapdragon mobile products in 2007. V3featured an improved implementation withbetter power consumption. These early ver-sions of Hexagon targeted voice and audioprocessing. Example functions include wide-band vocoders, echo cancellation, audiopostprocessing filters, MP3/AAC play-back, speaker protection algorithms, andso on.

V4 and V5 expanded the application tar-gets to include image processing for cameraand video; computer vision tasks such ashand, gesture, and face recognition; andprocessing of sensor input (gyro, accelerome-ter, fingerprint, and so on). Unless otherwisenoted, this article will focus on the latestHexagon V5 core.

Lucian Codrescu

Willie Anderson

Suresh Venkumanhanti

Mao Zeng

Erich Plondke

Chris Koob

Ajay Ingle

Charles Tabony

Rick Maule

Qualcomm

.......................................................

34 Published by the IEEE Computer Society 0272-1732/14/$31.00�c 2014 IEEE

• aDSP: Real-timemedia & sensorprocessing

Camera KraitAdreno

Krait

Snapdragon800

Audio

Display

JPEG

Video

CPUAdrenoGPU

KraitCPU

CPU

KraitCPU

Misc.connectivity

HexagonaDSP

Sensors

Other

Multimedia fabric System fabric

2-Mbyte L2

ModemFabric & memory controllerHexagon

mDSP

LPDDR3 LPDDR3• mDSP: Dedicated

modem processing

Figure 1. Snapdragon 800 block diagram. The chip contains dedicated subsystems for camera, display, video, audio/voice,

sensors, graphics, cellular modem, and Wi-Fi.

V3 mDSP45 nm

June 2009

V4 mDSP28 nm

Dec. 2010

V5 mDSP28 nm

Dec. 2012

V4 aDSPLow-tier28 nm

Apr. 2011V2 aDSP

65 nmDec. 2007

V3 aDSPLow-tier45 nm

Nov. 2009

V1 aDSP65 nm

Oct. 2006

V3 aDSP45 nm

Aug. 2009

V4 aDSP28 nm

Dec. 2010

V5 aDSP28 nm

Dec. 2012

Time

Figure 2. Hexagon digital-signal processor (DSP) evolution. The figure shows the evolution from Version 1 in October 2006

through Version 5 in December 2012.

.............................................................

MARCH/APRIL 2014 35

Hexagon is a multithreaded very longinstruction word (VLIW) DSP. The designphilosophy is to maximize work per cycle forperformance, but target the microarchitec-ture to modest clock speeds and low power.

Instruction-set architecture overviewThe architecture’s foundation is a stati-

cally scheduled four-way VLIW. The VLIWapproach puts the burden of instruction par-allelism on the compiler and thereby avoidscostly and power-hungry dynamic-schedulinghardware.2 The VLIW approach is popularamong commercial DSPs. Figure 3 shows ablock diagram of Hexagon.

Registers and memoryThe Hexagon processor features a unified

byte-addressable memory. This memory hasa single 32-bit virtual address space that holdsboth instructions and data. It operates inlittle-endian mode. A full-featured memorymanagement unit (MMU) translates virtualto physical addresses.

All user-level registers are replicated perthread. There are two sets of user registers:general registers and control registers. Thegeneral registers include thirty-two 32-bitregisters that can be accessed either as singleregisters or as aligned 64-bit register pairs.The general registers contain all pointer, sca-lar, vector, and accumulator data. The con-trol registers include special-purpose registerssuch as the program counter, status register,and loop registers.

Data-processing instructionsThere are two identical 64-bit single-

instruction, multiple-data (SIMD) executionunits. Each unit supports all multiply, shift,arithmetic logic unit (ALU), and bit manipu-lation instructions. Supported data typesinclude

� 8-, 16-, 32-, and 64-bit integer;� 16- and 32-bit fractional with optional

rounding and saturation;� 16-bit complex; and� single-precision IEEE-compatible float-

ing point.

Each unit is capable of supporting:

� four 16� 16 multiplies;� two 32� 16 multiplies; or� one 32 � 32 multiply, one complex

multiply, or one floating-point fusedmultiply-add (FMA).

Many of the instructions are complex andapplication specific. Complex instructionstargeted to a particular application domaincan provide high performance and energyefficiency. For example, Figure 4 depicts acomplex multiply instruction used in a16-bit fixed-point fast-Fourier transform(FFT). Without such an instruction, it wouldtake four multiplies, four shifts, four adds,and two saturates to perform the operation.It should be clear that packing all the work ina single instruction executed in a single pipe-lined execution unit provides large efficiencygains.

The Hexagon instruction set architecture(ISA) contains numerous special-purposeinstructions designed to accelerate key multi-media kernels. Multimedia algorithms withspecial instruction support include

Instruction unit

L2cache/

TCM

Data unit(load/store/ALU)

Data Unit(load/store/ALU)

Executionunit

(64-bitvector)

Executionunit

(64-bitvector)

Data cache

Register file/thread

Instructioncache

Figure 3. Hexagon block diagram. The architecture features a four-wide very

long instruction word (VLIW) with dual load/store and dual single-instruction,

multiple-data (SIMD) execution units and supports hardware multithreading.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

36 IEEE MICRO

� variable-length encode/decode, suchas context-adaptive binary-arithmetic-coding processing in H.264 video;

� features from accelerated segmenttest (FAST) corner detection imageprocessing;

� FFTalgorithms;� sliding-window filters;� linear-feedback shift;� table lookup from an arbitrary bit

field index;� elliptic curve cryptography; and� cyclic redundancy check (CRC)

calculation.

Load/store instructionsDual load/store units access signed or

unsigned 8-, 16-, 32-, and 64-bit values inmemory. There is a rich variety of addressingmodes, including

� absolute 32-bit,� base plus scaled immediate and base

plus scaled register,� auto-incrementing by register and

immediate,� circular addressing, and� bit reversed.

To increase the number of instructioncombinations allowed in packets, the load/store units also support 32-bit ALUinstructions.

Conditional execution and program flowThe Hexagon ISA includes conditional

execution. Conditional execution is useful toremove branches through if-conversion andis helpful for a VLIW processor. Compareinstructions target one of four predicate regis-ters. These predicate registers can then beused to conditionally execute certain instruc-tions. Not all instructions can be condi-tional—only the most common load/storeand ALU instructions.

A unique feature of Hexagon conditionalexecution is that the processor can generateand use a predicate in the same VLIW in-struction packet. This reduces packet countand creates denser packets, both of whichimprove performance and reduce energy con-sumption. Consider the following C state-ment and the corresponding assembly code

that is generated from it by the compiler. The“.new” suffix implies the source predicate isgenerated in the same packet. In this exam-ple, the dot-new construct enables the workto be done in one instruction packet insteadof two.

The C statement is as follows:if (R2 == 4)

R3 = *R4;

else

R5 = 5;

Assembly code with braces delineatepacket boundaries:{

P0 = cmp.eq(R2,#4)

if (P0.new) R3 = memw(R4)

if (!P0.new) R5 = #5

}

Similar to many DSP processors, Hexa-gon includes a zero-overhead hardwarecounted looping mechanism with supportfor two levels of nesting. An instruction isused to initialize the loop count and the startaddress. Bits encoded in the last packet of theloop delineate the end of the loop. Thisarchitecture allows execution of loops with

Rs R

R

32 32 32

Add Add

Sat_32Sat_32

High 16bitsHigh 16bits

RdRI

32

0x80000x8000 <<0–1<<0–1

<<0–1<<0–1

∗ ∗ ∗ ∗

R

RRt

Rs

RtI

I

I

I

Figure 4. Complex multiply instruction. Such an instruction executed in a

single pipelined execution unit provides large efficiency gains for

applications that use complex arithmetic.

.............................................................

MARCH/APRIL 2014 37

no branch mispredicts or stalls, and no hard-ware devoted to loop branch prediction.

Compound and memop instructionsCompound instructions combine two or

more dependent operations in a single instruc-tion. These instructions improve code size andsave power by reducing register file and for-warding power. Hexagon includes many suchinstructions, including shift-add, shift-or, add-add, compare-branch, shift-xor, and many fla-vors of the classic DSP multiply-add.

Another class of instruction performs sim-ple operations directly on memory, includingadd, subtract, logical–or, and logical–and.Without these memory operations (mem-ops), three instructions would be necessary toperform the same task: one to load the value,one to perform the arithmetic or logical oper-ation on the value, and one to store the result.Memops improve code size and reduce powerbecause intermediate register access is notneeded.

VLIW instruction groupingVLIW instruction packets are variable

sized and contain one to four instructions. Ifa packet contains more than one instruction,the instructions execute in parallel. Theinstruction combinations allowed in a packetare limited to the instruction types that canbe executed in parallel in the four executionunits. The processor uses parallel executionsemantics. All registers are read, then allinstructions are executed, then all registersare written.

Duplex instructionsLow code size is advantageous for an

embedded processor. Hexagon instructionsare fixed size and 32 bits in length. Toimprove code size, the Duplex feature enablessome use of 16-bit instructions by creating a32-bit subpacket containing two 16-bitinstructions. These subpackets are calledduplexes. Figure 5 shows a visualization of aduplex.

Because duplexes are always 32 bits,packet sizes continue to be multiples of 32bits. This leads to a simpler and lower-powerimplementation as compared to instructionsets with a mixed 16-/32-bit instruction set.Additionally, duplexes must always end apacket, and are always dispatched to the sametwo execution units, which further simplifiesthe implementation. The instructions allowedin duplexes, called subinstructions, are the mostcommon subset of normal Hexagon instruc-tions, with reduced ranges of registers andimmediate operands.

Multithreading and microarchitectureThe Hexagon processor is multithreaded.

The number of threads varies by implemen-tation. Early implementations included sixhardware threads, but more recent coresinclude three hardware threads. There aremany trade-offs in choosing the number ofthreads. Additional threads provide morelatency tolerance and enable power-savingopportunities in the microarchitecture byserializing work rather than speculatingwork. On the other hand, additional threadsincrease cache pressure and increase the soft-ware programming burden. Our experienceis that three or four threads are a sweet spotin the design space.

Hexagon is designed to look like a multi-core architecture with communicationthrough shared memory. Figure 6 shows howthe processor appears to the programmer.Software threads are mapped to hardwarethreads by the operating system.

In the physical implementation, however,there is only one processor, which the threehardware threads share. Hexagon V1 throughV4 implemented a simple round-robin inter-leaved multithreading (IMT) approach.3 Onevery clock tick, a different thread is given a

Each box is an instruction.

Packet of four 32-bit instructions

Packet of two 32-bit instructions and a duplex

32 bits

32 bits

32 bits

32 bits

32 bits

32 bits

16 bits 16 bits

Figure 5. Visualization of a duplex. A duplex

is a 32-bit subpacket containing two 16-bit

instructions.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

38 IEEE MICRO

turn at each pipe stage. Figure 7 shows athree-stage execution pipeline and with threethreads taking turns dispatching packets.

With the number of threads matched tothe execution pipe depth, all of a thread’sinstructions from a VLIW packet are com-plete before the next VLIW packet starts.

Because there is no observable latency, thecompiler is not concerned with instructionlatency and scheduling for latency. Thisyields higher VLIW packet density. Wheninstructions are no longer needed to hidelatency, they can be used instead to fill thepackets.

Thread 0 Thread 1 Thread 2

Shared instruction cache

DU DU L2cache/ TCM

XU DU XUDU XU

Register file Register file Register file

Shared data cache

DU DU XUXU XU

Figure 6. The programmer’s view of multithreading. To the programmer, it appears as three

VLIW cores with shared caches. Software threads are mapped to hardware threads by the

Hexagon operating system.

Thread 0 dispatch Thread 1 dispatch Thread 2 dispatch

T0: { Ld Ld Add Cmp } T1: { St Ld Mpy Add }

T0: { Ld Ld Add Cmp }

T2: { Ld

Add Jump }

T1: { St Ld

Mpy

Add }

T0: { Ld

Ld

Add Cmp

}

Figure 7. Interleaved multithreading. The figure shows a three-stage execution pipeline with three threads taking turns

dispatching packets.

.............................................................

MARCH/APRIL 2014 39

The simple IMT model and simple in-order pipeline yield a small, low-power pro-cessor that is critical to meeting the aggressivearea and power targets.

The obvious problem with IMT is thatwhen threads are idle or stalled, their sliceof the processor goes unused. Starting withHexagon V5, a more dynamic approach tothread scheduling has been implemented.Often, packets contain only simple instruc-tions and can be completed in fewer thanthree cycles. With Hexagon V5, the pro-cessor will opportunistically execute packetsfaster if threads are idle or stalled and simplepackets are available. The design philosophyis not to shoot for the best single-threadperformance, but rather to provide someperformance boost when it is easy to doso and without compromising energyefficiency.

Figure 8 shows the instructions percycle (IPCs) for various multimediabenchmarks. The benchmarks are sortedinto multithreaded applications on theleft, and single-threaded on the right. Theboost from the V5 dynamic multithreading

(DMT) is shown as the additional (striped)bar on top of the baseline (solid) IMT bar.

In the Snapdragon 800 implementation,the DSP runs up to 800 MHz. The instruc-tion cache is 16 Kbytes, the data cache is 32Kbytes, and the level-2 (L2) cache is 256Kbytes. Connection to main memory is pro-vided over a 64-bit system bus that runs at240 MHz.

System programming modelCommunication between the DSP and

CPU is done through a traditional shared-memory-plus-interrupt mechanism. Both theDSP and CPU can access the full physicaladdress space and share the external memory.Access to memory is cache based, and there isno explicit data mover. The DSP includes anextensive prefetching capability to help hidecache latency. The CPU and DSP are notcache coherent with each other, so coherencymust be maintained in software with explicitcache maintenance operations.

A software remote procedure call (RPC)interface lets a CPU application offload workto the DSP. When an RPC is made, any dataassociated with the call is flushed to mainmemory from the CPU caches and mappedinto the DSP virtual address space. The DSPis then interrupted to process the RPC call,after which any results are flushed from theDSP caches back to main memory, and acompletion interrupt is sent to the CPU.

The overheads of software-managedcoherency preclude offloading very small tasksto the DSP. Large kernels that run continu-ously or process large data (full image frames)are typically needed to amortize the overhead.

PowerBattery life is extremely important in

mobile computing. Offloading applicationsfrom the CPU to a specialized low-powerprocessing engine such as Hexagon is criticalto achieving the power goals. In addition tothe ISA and microarchitecture for efficientmultimedia processing, Hexagon is imple-mented with aggressive low-power designtechniques, including hierarchical clock gat-ing with a custom clock tree, voltage scalingwith split-grid memories, pulse latches

4.5

5.0

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Feat

ure

extr

act

SIF

T

Face

det

ect

H26

4 d

ecod

e

VP

8 d

ecod

e

Wav

e d

enoi

se

JPE

GD

MP

3

2D-3

D C

vt

BM

3D d

enoi

se

JPE

GC

Cor

emar

k

Dhr

ysto

ne

spec

95_g

o

List

en

Inst

ruct

ions

per

cyc

le

IPC_DMT

IPC_IMT

Figure 8. Performance on multimedia applications. The figure shows the

instructions per cycle (IPCs) for various multimedia benchmarks, which are

sorted into multithreaded applications (left) and single-threaded applications

(right).

..............................................................................................................................................................................................

HOT CHIPS

............................................................

40 IEEE MICRO

instead of flip-flops, and full-custom cachesand register files designed for low power. Amore complete description of the low-powertechniques used by Hexagon is availableelsewhere.4

An important power benchmark formobile phones is MP3 playback. Figure 9compares current measured at the battery forvarious competitive smartphones. Batterycurrent includes all components of systempower, such as the CPU, DSP, memory, andI/O. The data is presented as (battery currentin mA for the device divided by battery cur-rent in mA for the Hexagon-based chip). A2� delta on this chart represents twice thehours of music playback, which is a key mar-keting and user-experience metric.

Google recently announced support forDSP offload of audio playback in theAndroid 4.4 (KitKat). Quoting from the“Audio Tunneling to DSP” section on Goo-gle’s Android developer webpage:5

For high-performance, lower-power audio play-back, Android 4.4 adds platform support foraudio tunneling to a digital signal processor (DSP)in the device chipset. With tunneling, audiodecoding and output effects are off-loaded to theDSP, waking the application processor less oftenand using less battery.

Audio tunneling can dramatically improve bat-tery life for use-cases such as listening to musicover a headset with the screen off. For example,with audio tunneling, Nexus 5 offers a total off-network audio playback time of up to 60 hours,an increase of over 50% over non-tunneled audio.

The Nexus 5 device uses Snapdragon 800,and the audio offload is done to the HexagonV5 aDSP.

Figure 10 shows another example of off-loading a computer-vision-object detectionalgorithm from the CPU to the HexagonDSP. Initially, the algorithm is run on theCPU. The algorithm is fully optimized withNeon SIMD instructions. After offloading tothe DSP, the same algorithm is called via anRPC. Because the algorithm is no longer run-ning on the CPU, the CPU load is reduced.In terms of speed, the algorithm is marginallyfaster on the DSP. This includes all RPCoverheads. But, significantly, the total systempower as measured at the battery has beenreduced by 32 percent.

3.0

3.5

2.0

2.5

1.0

1.5

0.5

Competi

torA

Competi

torB

Competi

torC

Competi

torD

Competi

torE

Competi

torF

Competi

torG

Hexag

on-b

ased

Bat

tery

cur

rent

in m

A r

elat

ive

to H

exag

on-b

ased

chi

p

0

Figure 9. MP3 playback battery current for various smartphones. Lower

bars represent more hours of music playback, which is a critical customer

metric.

1.00

1.20

0.00

0.20

0.40

0.60

0.80

CPU utilization (%) Detection time(msec)

Total power (mW)

ARM/Neon Hexagon app DSP

Figure 10. An example of feature detection offload. The algorithm is

offloaded from the CPU to DSP, which results in comparable performance

but much reduced battery current.

.............................................................

MARCH/APRIL 2014 41

Figure 11 provides additional examples ofperformance and power, comparing the Hex-agon V5 DSP to the CPU used in the Snap-dragon 200 chip. Each chart features adifferent algorithm used in a computer visionapplication. For the CPU, all code is fullyoptimized using Neon SIMD instructions.Both single-CPU and quad-CPU data isshown. The x-axis shows latency in units oftime/pixel, and the y-axis shows energy/pixel.The origin is (0,0) in all charts. Power ismeasured at the battery and includes all sys-tem power. When compared to the quadCPU in Snapdragon 200, the Hexagon V5DSP provides similar or better performanceand lower power in all cases.

T his article provides an overview of theHexagon DSP architecture. The

demand for low-power signal processing inmobile applications continues unabated.Camera and video applications requiresophisticated signal processing at ultra-highdefinition resolution. At the same time,“always-on” voice activation features arepushing power requirements to new lows.Future versions of Hexagon DSP will beenhanced and specialized to tackle theseupcoming challenges.

For readers who would like to explorefurther, the Hexagon Software Developer’sKit (https://developer.qualcomm.com/mobile-development/maximize-hardware/multimedia-optimization-hexagon-sdk/multimedia-optimization-h-2) provides everythingneeded to program the DSP, includingfull documentation, software tools, acycle-approximate simulator, and examplecode. MICRO

....................................................................References1. L. Codrescu et al., “Qualcomm Hexagon

DSP: An Architecture Optimized for Mobile

Multimedia and Communications,” Hot

Chips 25, 2013.

2. J. Fisher, “VLIW Architectures: An Inevita-

ble Standard for the Future?” J. Supercom-

puter, vol. 7, no. 2, 1990, pp. 29-36.

3. B. Smith, “Architecture and Applications of

the HEP Multiprocessor Computer Sys-

tem,” SPIE Real Time Signal Processing IV,

1981, pp. 241-248.

4. M. Saint-Laurent et al., “A 28 nm DSP Pow-

ered by an On-Chip LDO for High-Per-

formance and Energy-Efficient Mobile

Applications,” to be published in Proc. IEEE

Int’l Solid-State Circuits Conf., 2014.

Canny

Time/pixel(nSec)

Ene

rgy/

pix

el(n

J)

Fast9

Time/pixel(nSec)

Ene

rgy/

pix

el(n

J)

Intensity histogram

Time/pixel(nSec)

Ene

rgy/

pix

el(n

J)

Harris

Time/pixel(nSec)

Ene

rgy/

pix

el(n

J)

NCC

Time/pixel(nSec)

Ene

rgy/

pix

el(n

J)

- CPUx1 CPU @ 1.2 GHz - CPUx4 - DSP DSP @ 690 MHz

Figure 11. Energy versus latency for multimedia benchmarks. Each chart features a different algorithm used in a computer

vision application.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

42 IEEE MICRO

5. “Android KitKat: New Media Capabilities,”

Android.com, Jan. 2014, http://developer.

android.com/about/versions/kitkat.html#44-

media.

Lucian Codrescu is a senior director atQualcomm, where he leads the HexagonArchitecture team. Codrescu has a PhD incomputer engineering from the GeorgiaInstitute of Technology.

Willie Anderson is a vice president at Qual-comm, where he runs the DSP engineeringorganization. Anderson has a BS in physicsfrom the University of Texas at Austin.

Suresh Venkumanhanti is a principal engi-neer at Qualcomm, where he is responsiblefor the DSP core microarchitecture. Venku-manhanti has an MS in electrical engineer-ing from the University of Florida.

Mao Zeng is a senior staff engineer at Qual-comm and a principal designer of the DSPinstruction set architecture. Zeng has a PhDin electrical engineering from the Universityof Victoria.

Erich Plondke is a senior staff engineer atQualcomm, where he leads the systemarchitecture team. Plondke has an MS inelectrical engineering from the GeorgiaInstitute of Technology.

Chris Koob is a principal engineer at Qual-comm, where he is responsible for the memorysystem microarchitecture. Koob has an MS inelectrical engineering from Virginia Tech.

Ajay Ingle is a principal engineer at Qual-comm, where he is responsible for DSParchitecture support of modem applications.Ingle has an MS in electrical engineeringfrom Texas A&M University.

Charles Tabony is a senior engineer atQualcomm, where he contributes to theinstruction set design and performance veri-fication. Tabony has a BS in computer engi-neering from the University of Texas.

Rick Maule is a senior director at Qual-comm, where he is responsible for DSPproduct management. Maule has an MS inelectrical engineering from the University ofArkansas.

Direct questions and comments about thisarticle to Lucian Codrescu at [email protected].

.............................................................

MARCH/APRIL 2014 43