an embedded low power/cost 16-bit data/instruction ... · data/instruction microprocessor...

23
12 nd ASP-DAC, Jan. 26, 2007 An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools Fu-Ching Yang and Ing-Jer Huang Dept. of Computer Science and Engineering. National Sun Yat-Sen University, Kaohsiung Taiwan [email protected] [email protected]

Upload: nguyendien

Post on 19-Aug-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

12nd ASP-DAC, Jan. 26, 2007

An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools

An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools

Fu-Ching Yang and Ing-Jer Huang

Dept. of Computer Science and Engineering.

National Sun Yat-Sen University, Kaohsiung Taiwan

[email protected]@cse.nsysu.edu.tw

2/22

OutlineOutline

MotivationMotivation

Related WorkRelated Work

Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse

Experimental ResultExperimental Result

ConclusionConclusion

3/22

OutlineOutline

MotivationMotivation

Related WorkRelated Work

Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse

Experimental ResultExperimental Result

ConclusionConclusion

4/22

MotivationMotivation

ShortShort--precision oriented applicationsprecision oriented applications–– Over Over 50%50% of the operations are of the operations are less than 16 bitsless than 16 bits

Brooks, D. and Martonosi, M.,”Dynamically exploiting narrow width operands to Dynamically exploiting narrow width operands to improve processor power and performanceimprove processor power and performance,,”” Proc. of Intl. Symposium on Fifth Proc. of Intl. Symposium on Fifth HighHigh--Performance Computer ArchitecturePerformance Computer Architecture, pp. 13, pp. 13--22, 1999.22, 1999.

About 50% of the operations are less than 16-bit

Bitwidths for SPECint95 on 64-bit Alpha.

5/22

OutlineOutline

MotivationMotivation

Related WorkRelated Work

Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse

Experimental ResultExperimental Result

ConclusionConclusion

6/22

Related workRelated work

DynamicDynamic explore data bit widthexplore data bit width

Static compiler analysisStatic compiler analysis–– ARM7/ARM9 cores:ARM7/ARM9 cores:–– 3232--bit RISC+CISC instruction setbit RISC+CISC instruction set–– 1616--bit RISCbit RISC–– 3232--bit databit data

What if the major data types in your What if the major data types in your SoCSoC are are 8 8 or or 16 16 bitsbits??–– 3232--bit data path may be an bit data path may be an overkilleroverkiller

–– bad forbad for performanceperformance, , powerpower, , chip sizechip size, , etcetc. .

7/22

Memory bandwidth bottleneckMemory bandwidth bottleneck

ARM7/ARM9 cores perform wellARM7/ARM9 cores perform well–– Memory bandwidth : Memory bandwidth : 32/6432/64

What if your What if your SoCSoC can not afford such bandwidthcan not afford such bandwidth–– In In costcost--sensitivesensitive applicationsapplications–– 1616--bitbit bus: bus: 22 cycles/per instruction (or data) fetchcycles/per instruction (or data) fetch–– 88--bitbit bus: bus: 44 cycles/per instruction (or data) fetch cycles/per instruction (or data) fetch

8/22

OutlineOutline

MotivationMotivation

Related WorkRelated Work

Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse

Experimental ResultExperimental Result

ConclusionConclusion

9/22

Proposed solutionProposed solution

A 16A 16--bit Thumb Microprocessor bit Thumb Microprocessor –– SYS16TMSYS16TM–– 1616--bit instruction set: same as THUMBbit instruction set: same as THUMB–– 1616--bit data pathbit data path–– An An extra low costextra low cost ARMARM--based solutionbased solution–– CompatibleCompatible with most with most ARMARM’’ss existing development existing development

toolstools

Architecture challengesArchitecture challenges1.1. Register file bank mergingRegister file bank merging2.2. Pipeline TrimmingPipeline Trimming3.3. Reuse Reuse ARMARM’’ss develop environmentdevelop environment

10/22

Register file bank mergingRegister file bank merging

R0~R12 : 16 bitsR0~R12 : 16 bits

SP, LR, PC : 32 bitsSP, LR, PC : 32 bits

CPSR, SPSR : 32 bitsCPSR, SPSR : 32 bits

Define new instruction to access high halfDefine new instruction to access high half--word word of SP, LR, PCof SP, LR, PC

11/22

Architecture ChangesArchitecture Changes

Fetch stageFetch stage–– Replace the Replace the adderadder with added by 2with added by 2

Decode stageDecode stage–– Remove the Remove the THUMB THUMB decompressordecompressor–– Modify the Decoder to decode THUMB directlyModify the Decoder to decode THUMB directly–– Modify the register file to Modify the register file to single modesingle mode

Execution stageExecution stage–– Modify Modify multipliermultiplier–– Remove the Remove the adderadder after multiplierafter multiplier–– Cut the Cut the shifter before ALU pathshifter before ALU path

12/22

Decode THUMB instruction directlyDecode THUMB instruction directly

Decode Stage

13/22

Execute stage modificationExecute stage modification

14/22

Coding rules for reusing ARM7’s Software EnvironmentCoding rules for reusing ARM7’s Software Environment

Linker

C code

Assemblycode

Object EXEAssembler

Compiler

Declare Declare ““short short ”” instead of instead of ““intint”” in C languagein C languagethe registers in SYS16TM are the registers in SYS16TM are 16 bits16 bits

••Remove shift instruction patternsRemove shift instruction patterns••Generated by conventional 32Generated by conventional 32--bit compilerbit compiler••Clear upper 16 bits valueClear upper 16 bits value

•• Emulating 16Emulating 16--bit computation with a 32bit computation with a 32--bit data pathbit data path

Interrupt table maintain the Interrupt table maintain the same addressingsame addressing

C codeObject

Assemblycode

15/22

OutlineOutline

MotivationMotivation

Related WorkRelated Work

Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse

Experimental ResultExperimental Result

ConclusionConclusion

16/22

Chip size, clock frequencyChip size, clock frequency

TSMC 0.18TSMC 0.18uum Processm Process ARM7ARM7 SYS16TMSYS16TM SYS16TM/ASYS16TM/ARM7RM7

50,61250,612 20,23220,232

137MHz137MHz

PowerPower 5.33 5.33 mWmW 2.7 2.7 mWmW 51%51%

91 MHz91 MHz

40%40%

151%151%

Gate countGate count

Max. Clock rate (MHz)Max. Clock rate (MHz)

The gate count reductionThe gate count reduction

F e tc h D e c o d e E x e c u t io n0 .0

0 .2

0 .4

0 .6

0 .8

1 .0

4 2 %3 4 %

7 7 %

Gat

e co

unt r

educ

tion

P ip e l in e S ta g e

17/22

Static code Size : SYS16TM VS. ARM7Static code Size : SYS16TM VS. ARM7

SYS16TMSYS16TM’’s code size is smaller : s code size is smaller : 61%61%

qsortdijkstra

hanoiGCD

fibonacci

knightjpeg encoder

050

100150200250300350400450

35004000450050005500600065007000

Cod

e S

ize

Benchmark

ARM7 code size SYS16TM code size

18/22

Executed Internal Instruction countExecuted Internal Instruction count

SYS16TM requires SYS16TM requires 1.12 times1.12 times instruction count instruction count comparing to ARM7 on averagecomparing to ARM7 on average

qsortdijkstra

hanoiGCD

fibonacci

knightjpeg encoder

0.00.20.40.60.81.01.21.41.61.82.0

Inst

ruct

ion

Rat

io

Benchmark

SYS16TM/ARM7

19/22

When considering limited memory bandwidthWhen considering limited memory bandwidth

With 32With 32--bit memory bandwidthbit memory bandwidth–– ARM7 is better than SYS16TM by ARM7 is better than SYS16TM by 6%6%

With 16With 16--bit memory bandwidthbit memory bandwidth–– SYS16TM is better than ARM7 by SYS16TM is better than ARM7 by 64%64%

qsortdijkstra

hanoiGCD

fibonacci

knightjpeg encoder

0.00.20.40.60.81.01.21.41.61.8

Cyc

le c

ount

ratio

Benchmark

ARM32 / SYS16 ARM16 / SYS16

20/22

Microprocessor energy consumptionMicroprocessor energy consumption

Consider the realConsider the real--time applicationtime application–– Assume both processors have to complete the applications at the Assume both processors have to complete the applications at the

same timesame time

qsortd ijks tra

hano iG C D fibonacc i

kn igh tjpeg encoder

0 .00 .10 .20 .30 .40 .50 .60 .70 .80 .91 .0

Ene

rgy

Rat

io

B e n ch m a rk

S Y S 1 6 T M /A R M 3 2 S Y S 1 6 T M /A R M 1 6

Memory Memory bandwidthbandwidth

Energy consumption ratio Energy consumption ratio (SYS16TM/ARM7)(SYS16TM/ARM7)

3232--bitbit 55%55%1616--bitbit 30%30%

21/22

OutlineOutline

MotivationMotivation

Related WorkRelated Work

Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse

Experimental ResultExperimental Result

ConclusionConclusion

22/22

Conclusions:SYS16TM VS. ARM7Conclusions:SYS16TM VS. ARM7

Proposed a 16Proposed a 16--bit data 16bit data 16--bit instruction THUMB bit instruction THUMB microprocessormicroprocessor

FeaturesFeatures–– 60%60% smaller in gate count smaller in gate count –– 39%39% smaller in the program size smaller in the program size –– faster in the maximal clock rate under the same faster in the maximal clock rate under the same

technologytechnology–– 51%51% for 0.18for 0.18μμmm

–– 33%33% faster in cycles (@ same clock rate) under faster in cycles (@ same clock rate) under 1616--bit bit memory bandwidthmemory bandwidth

–– More than More than 49%49% power reduction power reduction –– 70%70% energy reduction on energy reduction on 1616--bitbit memory bandwidthmemory bandwidth–– Utilizing the Utilizing the existing ARM7's softwareexisting ARM7's software development development

environment with minor patchingenvironment with minor patching

23/22

Thank You