tms320c6x chapter programming 3 - uccs

64
ECE 5655/4655 Real-Time DSP 3–1 TMS320C6x Programming Introduction In this chapter programming the TMS320C6x in assembly, linear assembly, and C will be introduced. Preference will be given to explaining code development for the DSK memory map. The basis for the material presented in this chapter are the course notes from TI’s C6000 4-day design workshop 1 . Programming Alternatives 1.TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000. C Linear ASM ASM Efficiency* Effort Compiler Optimizer Assembly Optimizer 70 – 80% 95 – 100% 100% Low Medium High * Typical efficieny versus hand optimized assembly see TI benchmarks for more information Hand Optimize Intrinsics Chapter 3

Upload: others

Post on 25-Jan-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

ECE 5655/4655 Real-Time DSP 3–1

TMS320C6x ProgrammingIntroductionIn this chapter programming the TMS320C6x in assembly, linearassembly, and C will be introduced. Preference will be given toexplaining code development for the DSK memory map. Thebasis for the material presented in this chapter are the coursenotes from TI’s C6000 4-day design workshop1.

Programming Alternatives

1.TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000.

C

Linear

ASM

ASM

Efficiency* EffortCompilerOptimizer

AssemblyOptimizer

70 – 80%

95 – 100%

100%

Low

Medium

High

* Typical efficieny versus hand optimized assembly see TI benchmarks for more information

HandOptimize

Intrinsics

Chapter

3

Chapter 3 • TMS320C6x Programming

3–2 ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Pro-gramming

A Dot Product Example

• Recall the C6000 block diagram

• To motivate this introduction to assembly programming, con-sider a basic sum of products or dot product example

(3.1)

• Assembly instructions will initially be shown only with lim-ited detail

• In a later section the details of putting together an actualassembly file will be given

• The core of this algorithm is multiplication and addition

Internal BusesInternal Buses

CPUCPU

.D1.D1

.M1.M1

.L1.L1

.S1.S1

.D2.D2

.M2.M2

.L2.L2

.S2.S2

Regs (B

0R

egs (B0 -- B

15)B15)

Regs (A

0R

egs (A0 -- A

15)A

15)

Control RegsControl Regs

CPUCPU

.D1.D1

.M1.M1

.L1.L1

.S1.S1

.D2.D2

.M2.M2

.L2.L2

.S2.S2

Regs (B

0R

egs (B0 -- B

15)B15)

Regs (A

0R

egs (A0 -- A

15)A

15)

Control RegsControl Regs

EMIFEMIF

Ext’lMemory

Ext’lExt’lMemoryMemory

-- SyncSync-- AsyncAsync

ProgramProgramRAMRAM Data RamData Ram

D (32)D (32)

Serial PortSerial Port

Host PortHost Port

Boot LoadBoot Load

TimersTimers

Pwr DownPwr Down

DMADMA

AddrAddr

y anxnn 1=

40

¦=

Introduction to Assembly Language Programming

ECE 5655/4655 Real-Time DSP 3–3

• To multiply we use the .M (multiply) unit

– As shown here MPY calls a 16-bit multiply which gives a32-bit result

• To add or accumulate we use the .L (logical) unit

.M.M.M

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MPYMPY .M.M a, x, proda, x, prod

.M.M.M

.L.L.L

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MPYMPY .M.M a, x, proda, x, prodADDADD .L.L Y, prod, YY, prod, Y

Where arethe variables

stored?

Where areWhere arethe variablesthe variables

stored?stored?

Chapter 3 • TMS320C6x Programming

3–4 ECE 5655/4655 Real-Time DSP

• Note that we need to store the working variables in a registerfile, the C6000 has two, but for now we will just use the Aside

• We now rewrite the code to include the actual register names

• The original equation (3.1) specifies 40 multiply accumulates

• To create a loop we need:

– A branch instruction and a label

– A loop counter variable

– An instruction to decrement the loop counter

– A properly set branch condition

.M.M.M

.L.L.L

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4

A0A0A1A1A2A2A3A3A4A4

Register File ARegister File A

......

aaxx

prodprod

A15A15

3232--bitsbits

YY

A0A0A1A1A2A2A3A3A4A4

Register File ARegister File A

......

aaxx

prodprod

A15A15

3232--bitsbits

YY

Introduction to Assembly Language Programming

ECE 5655/4655 Real-Time DSP 3–5

• The unit responsible for branching is the .S (branch) unit

– MVK moves a 16-bit constant into the lower 16-bits of reg-ister A2

– We decrement the loop counter register by one using SUBwhich uses the .L unit

– Branch condition instructions execute conditionally basedon the value held in A2;general asm code form[condition] B loop

– The [A2] means execute if

– If we use [!A2] then execute only if

– On the C62x/C67x conditional registers are limited to A1,A2, B0, B1, B2

– Note: On the C64x the conditional registers are A0, A1,A2, B0, B1, B2

.M.M.M

.L.L.L

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MVKMVK .S.S 40, A240, A2loop:loop:

MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2

[A2][A2] BB .S.S looploop

.S.S.SA0A0A1A1A2A2A3A3A4A4

Register File ARegister File A

......

aaxx

prodprod

A15A15

3232--bitsbits

YY

loop countloop count

A0A0A1A1A2A2A3A3A4A4

Register File ARegister File A

......

aaxx

prodprod

A15A15

3232--bitsbits

YY

loop countloop count

A2 0z

A2 0=

Chapter 3 • TMS320C6x Programming

3–6 ECE 5655/4655 Real-Time DSP

• The next step is to get variables loaded into the register file

– We assume that the variables are located in memory (inter-nal or external)

– We then create a pointer to the address of the variable andstore it in a register

– Finally, we load the variable itself into another register

• The C notation of &a is used here to obtain the address of a,but there is more to this as we will see shortly

• The C62 has 3 three load instructions and the C67 and C64add a fourth

– The architecture allows byte level addressing (8-bits), half-word (16-bits), words (32-bits)

– Added on the C67/64 are double-words (64-bits)

.M.M.M

.L.L.L

.S.S.SA0A0A1A1A2A2A3A3A4A4

Register File ARegister File Aaaxx

prodprod

A15A15

3232--bitsbits

YY

loop countloop count

.M.M.M

.L.L.L

.S.S.SA0A0A1A1A2A2A3A3A4A4

Register File ARegister File Aaaxx

prodprod

A15A15

3232--bitsbits

YY

loop countloop count

A0A0A1A1A2A2A3A3A4A4

Register File ARegister File Aaaxx

prodprod

A15A15

3232--bitsbits

YY

loop countloop count

How do a and x get loaded?How do a and x get loaded?�� a, x, Y located in memorya, x, Y located in memory

MemoryMemorya [40]a [40]x [40]x [40]

YYMemoryMemory

a [40]a [40]x [40]x [40]

YY

�� Create a pointer to values Create a pointer to values A5 = &aA5 = &aA6 = &xA6 = &xA7 = &Y A7 = &Y

....

A5A5A6A6A7A7

&a[n]&a[n]&x[n]&x[n]&Y&Y....

A5A5A6A6A7A7

&a[n]&a[n]&x[n]&x[n]&Y&Y

*A5*A5*A6*A6*A7*A7

*A5*A5*A6*A6*A7*A7

�� Use pointer with load/storeUse pointer with load/storeLDLD *A5, A0*A5, A0LDLD *A6, A1*A6, A1STST A4, *A7 A4, *A7

Introduction to Assembly Language Programming

ECE 5655/4655 Real-Time DSP 3–7

• Load and store option summary:

• To carry out the load and store operations we use the .D(data) unit

• Note that as in C, *A5 takes the value pointed to by A5 andplaces the value into a register, here it is A0

�� LoadLoad instructions:instructions:LDBLDB Load 8Load 8--bit bytebit byte (char)(char)LDHLDH Load 16Load 16--bit halfbit half--word word (short)(short)LDWLDW Load 32Load 32--bit wordbit word ((intint))LDDWLDDW Load 64Load 64--bit doublebit double--wordword (C67x, C64x)(C67x, C64x)

(double)(double)

�� StoreStore instructions:instructions:STBSTBSTHSTHSTWSTWSTDW STDW (C64x)(C64x)

�� LoadLoad instructions:instructions:LDBLDB Load 8Load 8--bit bytebit byte (char)(char)LDHLDH Load 16Load 16--bit halfbit half--word word (short)(short)LDWLDW Load 32Load 32--bit wordbit word ((intint))LDDWLDDW Load 64Load 64--bit doublebit double--wordword (C67x, C64x)(C67x, C64x)

(double)(double)

�� StoreStore instructions:instructions:STBSTBSTHSTHSTWSTWSTDW STDW (C64x)(C64x)

.M.M.M

.L.L.L

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0

LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2

[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7

.S.S.S

.D.D.D

Data MemoryData MemoryData Memory

A0A0A1A1A2A2A3A3A4A4

Register File ARegister File Aaaxx

prodprod

A15A15

3232--bitsbits

YY

loop countloop count

A0A0A1A1A2A2A3A3A4A4

Register File ARegister File Aaaxx

prodprod

A15A15

3232--bitsbits

YY

loop countloop count

....

A5A5A6A6A7A7

&a[n]&a[n]&x[n]&x[n]&Y&Y

Chapter 3 • TMS320C6x Programming

3–8 ECE 5655/4655 Real-Time DSP

• A remaining detail is the actual creation of a pointer, e.g., x,a, and y

• Earlier we used MVK to move a 16-bit constant into the lower16-bits of a register

• Now we want to move a 32-bit address corresponding tosome label a

– MVKL .S a,A5 ;will move the lower 16-bits withsign extension

– MVKH .S a,A5 ;will move the upper or high 16-bits without altering the lower 16-bits

– Use MVKL and MVKH in ordered combination to load con-stants greater the 16-bits, and MVK for 16-bit or less con-stants

• What should appear above the code MVK .S 40,A2 is:MVKL .S a,A5 ;store lower half of aMVKH .S a,A5 ;store upper half of aMVKL .S x,A6 ;store lower half of xMVKH .S x,A6 ;store upper half of xMVKL .S y,A7 ;store lower half of yMVKH .S y,A7 ;store upper half of y

• To properly loop over the data, the pointers need to be incr-mented

• The C notation “++” can be used to pre- or post-incrementregisters being used as pointers, e.g., A5++ increments byone the address held in A5 after it is used

Introduction to Assembly Language Programming

ECE 5655/4655 Real-Time DSP 3–9

• Pointer incrementing is summarized in the following figure:

• Since there is another set of function units we should havespecified which the side, e.g., .S1 for side A, etc.

aa00aa11aa22....

xx00xx11xx22....

&&aa&&xx

A5A5A6A6

A5A5 A6A6aa00aa11aa22....

xx00xx11xx22....

&&aa&&xx

A5A5A6A6

A5A5 A6A6 Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0

LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2

[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0

LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2

[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0

LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2

[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7

After first loop, A4 contains...After first loop, A4 contains...aa00 * * xx00

++++ ++++++++ ++++

How do you access How do you access aa11 and and xx11 on the second loop?on the second loop?

LDHLDH .D.D *A5++, A0*A5++, A0LDHLDH .D.D *A6++, A1*A6++, A1

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5++, A0*A5++, A0

LDHLDH .D.D *A6++, A1*A6++, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2

[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7

Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**Y =Y =4040

¦¦ aann xxnnn = 1n = 1

**

MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5++, A0*A5++, A0

LDHLDH .D.D *A6++, A1*A6++, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2

[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7

.S1.S1.S1

.M1.M1.M1

.L1.L1.L1

.D1.D1.D1

.S2.S2.S2

.M2.M2.M2

.L2.L2.L2

.D2.D2.D2

A0A0A1A1A2A2A3A3A4A4

Register File ARegister File A

......

Data MemoryData Memory

B0B0B1B1B2B2B3B3B4B4

Register File BRegister File B

......

B15B15

3232--bitsbits 3232--bitsbits

Chapter 3 • TMS320C6x Programming

3–10 ECE 5655/4655 Real-Time DSP

• The final version of the A-side code is

– In the above we assume A4 is initially cleared

Instruction Set Summary by Category

MVKMVK .S1.S1 40, A240, A2 ; A2 = 40, loop count; A2 = 40, loop countloop:loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0 ; A0 = a(n); A0 = a(n)

LDHLDH .D1.D1 *A6++, A1*A6++, A1 ; A1 = x(n); A1 = x(n)MPYMPY .M1.M1 A0, A1, A3A0, A1, A3 ; A3 = a(n) * x(n); A3 = a(n) * x(n)ADDADD .L1.L1 A3, A4, A4A3, A4, A4 ; Y = Y + A3; Y = Y + A3SUBSUB .L1.L1 A2, 1, A2A2, 1, A2 ; decrement loop count; decrement loop count

[A2][A2] BB .S1.S1 looploop ; if A2 ; if A2 zz 0, branch0, branchSTHSTH .D1.D1 A4, *A7A4, *A7 ; *A7 = Y; *A7 = Y

Y =Y =4040¦¦ aann xxnn

n = 1n = 1**Y =Y =

4040¦¦ aann xxnn

n = 1n = 1**

ArithmeticArithmeticABSABSADDADDADDAADDAADDKADDKADD2ADD2MPYMPYMPYHMPYHNEGNEGSMPYSMPYSMPYHSMPYHSADDSADDSATSATSSUBSSUBSUBSUBSUBASUBASUBCSUBCSUB2SUB2ZEROZERO

ArithmeticArithmeticABSABSADDADDADDAADDAADDKADDKADD2ADD2MPYMPYMPYHMPYHNEGNEGSMPYSMPYSMPYHSMPYHSADDSADDSATSATSSUBSSUBSUBSUBSUBASUBASUBCSUBCSUB2SUB2ZEROZERO

Program CtrlProgram CtrlBBIDLEIDLENOPNOP

Program CtrlProgram CtrlBBIDLEIDLENOPNOP

LogicalLogicalANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTNOTNOTORORSHLSHLSHRSHRSSHLSSHLXORXOR

LogicalLogicalANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTNOTNOTORORSHLSHLSHRSHRSSHLSSHLXORXOR

Data MgmtData MgmtLDB/H/WLDB/H/WMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKHMVKLHMVKLHSTB/H/WSTB/H/W

Data MgmtData MgmtLDB/H/WLDB/H/WMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKHMVKLHMVKLHSTB/H/WSTB/H/W

Bit MgmtBit MgmtCLRCLREXTEXTLMBDLMBDNORMNORMSETSET

Bit MgmtBit MgmtCLRCLREXTEXTLMBDLMBDNORMNORMSETSET

Introduction to Assembly Language Programming

ECE 5655/4655 Real-Time DSP 3–11

C62xx and C67xx Instruction Set Summary by Unit

.L .L .L

.D .D .D

.S .S .S

.M .M .M

.L .L .L

.D .D .D

.S .S .S

.M .M .M

No Unit UsedIDLEIDLENOPNOP

No Unit UsedIDLEIDLENOPNOP

.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO

ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH

.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO

ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH

.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO

ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM

.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO

ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM

.M Unit.M UnitSMPYSMPYSMPYHSMPYH

MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL

.M Unit.M UnitSMPYSMPYSMPYHSMPYH

MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL

.D Unit.D Unit.D Unit.D Unit.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO

ADDADDADDABADDAB (B/H/W)(B/H/W)

LDBLDB (B/H/W)(B/H/W)

MVMV

.D Unit.D Unit.D Unit.D Unit.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO

ADDADDADDABADDAB (B/H/W)(B/H/W)

LDBLDB (B/H/W)(B/H/W)

MVMV

.L .L .L

.D .D .D

.S .S .S

.M .M .M

No Unit UsedIDLEIDLENOPNOP

.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO

ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH

ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP

.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO

ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM

ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP

.M Unit.M UnitSMPYSMPYSMPYHSMPYH

MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL

MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID

.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO

ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV

.L .L .L

.D .D .D

.S .S .S

.M .M .M

.L .L .L

.D .D .D

.S .S .S

.M .M .M

No Unit UsedIDLEIDLENOPNOP

.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO

ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH

ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP

.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO

ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM

ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP

.M Unit.M UnitSMPYSMPYSMPYHSMPYH

MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL

MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID

.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO

ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV

No Unit UsedIDLEIDLENOPNOP

No Unit UsedIDLEIDLENOPNOP

.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO

ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH

ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP

.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO

ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH

ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP

.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO

ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM

ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP

.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO

ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM

ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP

.M Unit.M UnitSMPYSMPYSMPYHSMPYH

MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL

MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID

.M Unit.M UnitSMPYSMPYSMPYHSMPYH

MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL

MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID

.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO

ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV

.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO

ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV

•The C67 adds 31 More Instructions

Chapter 3 • TMS320C6x Programming

3–12 ECE 5655/4655 Real-Time DSP

• In total, the processor has only about 48 instructions, andhence is considered to be a RISC device

• Before going any further in assembly programming we needto spend some time studying the pipeline

Introduction to the Pipeline• DSP microprocessors rely heavily on the performance advan-

tages of pipelining, the C6x is no exception

• It would be nice to never have to worry about pipeline issues,but some exposure will be helpful in future programming

• Getting code to work only requires a few basic guidelines,while full optimization of the eight function units is beyondthe scope of this section of the notes

• The basic operations of the CPU are:

– (F) Fetch or Program Fetch (PF): get an instruction frommemory

– (D) Decode: figure out what type of instruction it is (ADD,MPY)

– (E) Execute: Actually perform the operation

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP 3–13

Pipelined and Non-Pipelined

• Once the pipeline is full the multiple buses of the C6x cancarry out the F, D, and E operations in parallel, all within thesame clock cycle

• On the downside, when discontinuities such as programbranching occur, the pipeline must be flushed which results inadded processor overhead

Program Fetch Stage

• The program fetch stage actally is broken into four phases

– PG: Generate fetch address

– PS: Send address to memory

– PW: Wait for data ready

– PR: Read opcode

FF11 DD11 EE11 FF22 DD22 EE22 FF33 DD33 EE33FF11 DD11 EE11 FF22 DD22 EE22 FF33 DD33 EE33

CPU TypeCPU Type

NonNon--PipelinedPipelined

PipelinedPipelined

Clock CyclesClock Cycles1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

FF11 DD11 EE11

FF22 DD22 EE22

FF33 DD33 EE33

Pipeline fullPipeline full

Chapter 3 • TMS320C6x Programming

3–14 ECE 5655/4655 Real-Time DSP

Decode Stage

• The decode stage consists of two phases

– DP: Route the instruction to a functional unit (dispatch)

– DC: Actually decode the instruction at the functional unit(decode)

Execute Stage

• For code writing purposes the execute stage is the most inter-esting

• On the C62x all instructions execute in a single cycle, butresults are delayed by varying amounts

• Furthermore, there is an additional cycle before the resultsare available, which is known as the pipeline latency

• Common examples of delay and latency

• As a result of the maximum delay of 5 cycles, there are sixexecute phases E1–E6

DescriptionDescription InstructionsInstructions DelayDelay LatencyLatency

Single CycleSingle Cycle All, except ...All, except ... 00 0 + 1 = 10 + 1 = 1

MultiplyMultiply MPY / SMPYMPY / SMPY 11 22

LoadLoad LDB/H/WLDB/H/W 44 55

BranchBranch BB 55 66

DescriptionDescription InstructionsInstructions DelayDelay LatencyLatency

Single CycleSingle Cycle All, except ...All, except ... 00 0 + 1 = 10 + 1 = 1

MultiplyMultiply MPY / SMPYMPY / SMPY 11 22

LoadLoad LDB/H/WLDB/H/W 44 55

BranchBranch BB 55 66

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP 3–15

Summary of Pipeline PhasesProgram Program

FetchFetch ExecuteExecuteDecodeDecode

DP DCDP DC E1 E1 E2 E3 E4 E5 E6E2 E3 E4 E5 E6(1) (2) (3) (4)(1) (2) (3) (4) (5) (6)(5) (6) (7) (8) (9) (10) (11) (12) (7) (8) (9) (10) (11) (12)

E2E2--E6E6 are place holdersare place holdersfor delayed resultsfor delayed results

PG PS PW PR DP DC PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7E1 E2 E3 E4 E5 E6 E7PG PS PW PR DP DC E1 PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1 PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1

PG PS PW PR DP DC PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7E1 E2 E3 E4 E5 E6 E7PG PS PW PR DP DC E1 PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1 PG PS PW PR DP DC E1

PG PS PW PR DP DC E1PG PS PW PR DP DC E1

Pipeline full

Chapter 3 • TMS320C6x Programming

3–16 ECE 5655/4655 Real-Time DSP

Sending Code Through the Pipeline

• Since there are eight function units, eight 32-bit instructionsare fetched every clock cycle

• The 256-bit total is called a fetch packet

• Recall that there is a 256-bit wide program data bus for thispurpose

Pipeline Code Example

• Consider the sum of products example used earlier

256 Bits256 Bits

I 1I 1 I 2I 2 I 3I 3 I 4I 4 I 5I 5 I 6I 6 I 7I 7 I 8I 8

256 Bits256 Bits

I 1I 1 I 2I 2 I 3I 3 I 4I 4 I 5I 5 I 6I 6 I 7I 7 I 8I 8

Fetch Packet (8 x 32Fetch Packet (8 x 32--bit)bit)

The 'C6x fetches eight 32The 'C6x fetches eight 32--bit bit instructions every cycleinstructions every cycle

; mycode.asm; mycode.asmI1I1 .unit.unitI2I2 .unit.unitI3I3 .unit.unitI4I4 .unit.unitI5I5 .unit.unitI6I6 .unit.unitI7I7 .unit.unitI8I8 .unit.unit

MVKMVK .S1.S1 40, A240, A2loop:loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0

LDHLDH .D1.D1 *A6++, A1*A6++, A1MPYMPY .M1.M1 A0, A1, A3A0, A1, A3ADDADD .L1.L1 A3, A4, A4A3, A4, A4SUBSUB .L1.L1 A2, 1, A2A2, 1, A2

[A2][A2] BB .S1.S1 looploopSTHSTH .D1.D1 A4, *A7A4, *A7

We assume A4 is already cleared

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP 3–17

• We have eight instructions, so on the first cycle they are in thePG phase of program fetch

• On the fifth cycle, assuming zero wait state memory, theeight instructions are now at the DP phase

• On the next cycle the first instruction moves to the DC

ProgramProgramFetchFetch

PG PS PW PRPG PS PW PRDecodeDecodeDP DCDP DC

ExecuteExecuteE1 E1 -- E6E6

MVKMVKLDHLDHLDHLDHMPYMPYADDADDSUBSUB

BBSTHSTH

1212

3399

66

11111010

11

88

22

77 5544

Chapter 3 • TMS320C6x Programming

3–18 ECE 5655/4655 Real-Time DSP

(decode) phase, and the other seven wait in line

• On cycle eight MVK has completed execution and LDH beginsexecution, but requires five total cycles (+ signs)

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

DoneDone

99

DecodeDecodeDP DCDP DC

LDHLDHLDHLDHMPYMPYADDADDSUBSUB

BBSTHSTH

Prog.Prog.FetchFetchPP

Prog.Prog.FetchFetchPP

1212

3399

66

11111010

11

88

22

77 5544

FPFP55--22

MVKMVK

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

DoneDone

99

DecodeDecodeDP DCDP DC

MPYMPYADDADDSUBSUB

BBSTHSTH

Prog.Prog.FetchFetchPP

Prog.Prog.FetchFetchPP

1212

3399

66

11111010

11

88

22

77 5544

FPFP55--22

MVKMVK

LDHLDHLDHLDH ++ ++ ++ ++

MVKMVK

LDHLDHLDHLDH ++ ++ ++ ++

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP 3–19

• On the 10th cycle the second LDH enters E2 and the first LDHis moved over to E3, with MPY at E1

– Note that the MPY requires only one delay, but needs val-ues from memory that the LDH’s bring in

– The LDH’s have not finished yet! What to do?

• A similar problem exists when the ADD instruction reachesE1

– The one cycle delay of MPY means that the addition hasstarted too early as well

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

DoneDone

99

DecodeDecodeDP DCDP DC

SUBSUBBB

STHSTH

Prog.Prog.FetchFetchPP

Prog.Prog.FetchFetchPP

1212

3399

66

11111010

11

88

22

77 5544

1212

3399

66

11111010

11

88

22

77 5544

FPFP55--22

MVKMVKLDHLDH ++ ++

LDHLDH ++ ++ ++MPYMPY

ADDADD++

MVKMVKLDHLDH ++ ++

LDHLDH ++ ++ ++MPYMPY

ADDADD++

Chapter 3 • TMS320C6x Programming

3–20 ECE 5655/4655 Real-Time DSP

• For the existing code, we see that at 12 cycles MPY and ADDhave both finished, but both LDH’s still have not completed

• To fix the code we need to add instruction delays or NOPs

– To start with we need to add one NOP between MPY andADD

– We need to add four NOPs between the second LDH andMPY

• Simple NOP insertion rules:

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

DoneDone

99

DecodeDecodeDP DCDP DC

STHSTH

Prog.Prog.FetchFetchPP

Prog.Prog.FetchFetchPP

1212

3399

66

11111010

11

88

22

77 5544

FPFP55--22

MVKMVK

MPYMPYLDHLDH ++

SUBSUB

LDHLDH

BB

ADDADD

MVKMVK

MPYMPYLDHLDH ++

SUBSUB

LDHLDH

BB

ADDADD

LDHLDH ++

SUBSUB

LDHLDH

BB

ADDADD

Single CycleSingle Cycle 00 00

MultiplyMultiply 11 11

LoadLoad 44 44

BranchBranch 55 55

DescriptionDescription Delay SlotsDelay Slots # of NOP’s# of NOP’s

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP 3–21

• Rather than typing four lines of NOP, we can type a singleline

• The final NOP “fixed code”, including benchmark informa-tion is the following:

– The NOPs greatly increase the cycle count, but we have nottried any optimization yet

– With full optimization just 28 cycles can be achieved, lessthan the loop count!

NOPNOPNOPNOP

NOP 4

MVKMVK .S1.S1 40,A240,A2loop: loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0

LDHLDH .D1.D1 *A6++, A1*A6++, A1NOPNOP 44MPYMPY .M1.M1 A0,A1,A3A0,A1,A3NOPNOPADDADD .L1.L1 A3,A4,A4A3,A4,A4SUBSUB .L1.L1 A2,1,A2A2,1,A2

[A2][A2] BB .S1.S1 looploopNOPNOP 55STHSTH .D1.D1 A4,*A7A4,*A7

MVKMVK .S1.S1 40,A240,A2loop: loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0

LDHLDH .D1.D1 *A6++, A1*A6++, A1NOPNOP 44MPYMPY .M1.M1 A0,A1,A3A0,A1,A3NOPNOPADDADD .L1.L1 A3,A4,A4A3,A4,A4SUBSUB .L1.L1 A2,1,A2A2,1,A2

[A2][A2] BB .S1.S1 looploopNOPNOP 55STHSTH .D1.D1 A4,*A7A4,*A7

Benchmark = _______ cyclesBenchmark = _______ cyclesBest case = _______ cyclesBest case = _______ cycles

(1)(1)(1)(1)(4)(4)(1)(1)

(1)(1)(1)(1)(1)(1)(1)(1)(5)(5)

LoopLoop = 16 = 16 xx 4040= 640= 640

(1)(1)(1)(1)(4)(4)(1)(1)

(1)(1)(1)(1)(1)(1)(1)(1)(5)(5)

LoopLoop = 16 = 16 xx 4040= 640= 640

(1)(1)

(1)(1)

+ 2 = 642 cycles+ 2 = 642 cycles(1)(1)

(1)(1)

+ 2 = 642 cycles+ 2 = 642 cycles

64264228 28

Chapter 3 • TMS320C6x Programming

3–22 ECE 5655/4655 Real-Time DSP

Use of Parallel Instructions

• In the pipeline example above all of the instructions flowedserially

• Parallel instructions are given with the double pipe symbol||

• Up to eight instructions can be put in parallel since there areeight functional units

• A partially parallel solution is given below:

• When instructions process in parallel they are called executepackets, and are so denoted in the pipeline diagrams

• Each fetch packet can contain multiple execute packets

SerialSerial PartiallyPartiallyParallelParallel

FullyFullyParallelParallel

B .S1B .S1MVK .S1MVK .S1ADD .L1ADD .L1ADD .L1ADD .L1MPY .M1MPY .M1MPY .M1MPY .M1LDW .D1LDW .D1LDB .D1LDB .D1

B .S1B .S1|| MVK .S2|| MVK .S2

ADD .L1ADD .L1|| ADD .L2|| ADD .L2|| MPY .M1|| MPY .M1

MPY .M1MPY .M1|| LDW .D1|| LDW .D1|| LDB .D2|| LDB .D2

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP 3–23

• At the beginning of the decode phase (dispatch), the aboveexample code, has three execute packets entering DC

• Each execute packet enters E1 and the individual instructionsexecute simultaneously until completed, with their respectivedelays

BBMVKMVK

ADDADDADDADDMPYMPY

MPYMPYLDWLDWLDBLDB

DecodeDecodeDP DCDP DC

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

DoneDone

99

1212

3399

66

11111010

11

88

22

77 5544

1212

3399

66

11111010

11

88

22

77 5544

Chapter 3 • TMS320C6x Programming

3–24 ECE 5655/4655 Real-Time DSP

• At cycle eight we have packet two at E1 and part of packetone is complete

• Parallel instructions give a great performance increase

• For the code example we have been considering it is possibleto go fully parallel since there are only eight instructions

• To do so will require full utilization of both sides of the CPU

MPYMPYLDWLDWLDBLDB

DecodeDecodeDP DCDP DC

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

DoneDone

99

1212

3399

66

11111010

11

88

22

77 5544

1212

3399

66

11111010

11

88

22

77 5544

BB

ADDADDADDADDMPYMPY

++ ++ ++ ++MVKMVK

++

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP 3–25

• The fully parallel code

• At the start of execution (seventh cycle) we have

SerialSerial PartiallyPartiallyParallelParallel

FullyFullyParallelParallel

B .S1B .S1MVK .S1MVK .S1ADD .L1ADD .L1ADD .L1ADD .L1MPY .M1MPY .M1MPY .M1MPY .M1LDW .D1LDW .D1LDB .D1LDB .D1

B .S1B .S1|| MVK .S2|| MVK .S2

ADD .L1ADD .L1|| ADD .L2|| ADD .L2|| MPY .M1|| MPY .M1

MPY .M1MPY .M1|| LDW .D1|| LDW .D1|| LDB .D2|| LDB .D2

B .S1B .S1|| MVK .S2|| MVK .S2|| ADD .L1|| ADD .L1|| ADD .L2|| ADD .L2|| MPY .M1|| MPY .M1|| MPY .M2|| MPY .M2|| LDW .D1|| LDW .D1|| LDB .D2|| LDB .D2

DecodeDecodeDP DCDP DC

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

1212

3399

66

11111010

11

88

22

77 5544

++ ++ ++ ++ ++

++++++ ++ ++ ++++ ++ ++ ++

EPEP22

BBMVKMVKADDADDADDADDMPYMPYMPYMPYLDWLDWLDBLDB

DecodeDecodeDP DCDP DC

ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6

DoneDone

99

DoneDone

99

1212

3399

66

11111010

11

88

22

77 5544

1212

3399

66

11111010

11

88

22

77 5544

++ ++ ++ ++ ++

++++++ ++ ++ ++++ ++ ++ ++

EPEP22

BBMVKMVKADDADDADDADDMPYMPYMPYMPYLDWLDWLDBLDB

Chapter 3 • TMS320C6x Programming

3–26 ECE 5655/4655 Real-Time DSP

• This sort of efficiency requires smart coding

• Two not so obvious requirements are:

– Properly filling delay slots

– Proper use of parallel instructions

• The assembly optimizer (part of linear assembly) and theoptimizing C compiler significantly simplify this process

C67x Exceptions

• With the floating point capability comes additional delay slotrequirements and latency

• There is also functional unit latency beyond one cycle, whichoccurs in some double precision (DP) instructions

.S Unit.S UnitCMPLTDPCMPLTDP (1.2)(1.2)RCPSPRCPSP (1.1)(1.1)RCPDPRCPDP (1.2)(1.2)RSQRSPRSQRSP (1.1)(1.1)RSQRDPRSQRDP (1.2)(1.2)SPDPSPDP (1.2)(1.2)

ABSSPABSSP (1.1)(1.1)ABSDPABSDP (1.2)(1.2)CMPEQSPCMPEQSP (1.1)(1.1)CMPGTSPCMPGTSP (1.1)(1.1)CMPLTSPCMPLTSP (1.2)(1.2)CMPEQDPCMPEQDP (1.3)(1.3)CMPGTDPCMPGTDP (1.3)(1.3)

.S Unit.S UnitCMPLTDPCMPLTDP (1.2)(1.2)RCPSPRCPSP (1.1)(1.1)RCPDPRCPDP (1.2)(1.2)RSQRSPRSQRSP (1.1)(1.1)RSQRDPRSQRDP (1.2)(1.2)SPDPSPDP (1.2)(1.2)

ABSSPABSSP (1.1)(1.1)ABSDPABSDP (1.2)(1.2)CMPEQSPCMPEQSP (1.1)(1.1)CMPGTSPCMPGTSP (1.1)(1.1)CMPLTSPCMPLTSP (1.2)(1.2)CMPEQDPCMPEQDP (1.3)(1.3)CMPGTDPCMPGTDP (1.3)(1.3)

.M Unit.M UnitMPYI MPYI (4.9) (4.9) MPYIDMPYID (4.10) (4.10)

MPYSPMPYSP (1.4)(1.4)MPYDPMPYDP (4.10)(4.10)

.M Unit.M UnitMPYI MPYI (4.9) (4.9) MPYIDMPYID (4.10) (4.10)

MPYSPMPYSP (1.4)(1.4)MPYDPMPYDP (4.10)(4.10)

.L Unit.L UnitINTSP (1.4)(1.4)INTSPU (1.4)(1.4)SPINT (1.4)(1.4)SPTRUNC (1.4)(1.4)SUBSP (1.4)(1.4)SUBDP (2.7)(2.7)

ADDSPADDSP (1.3)(1.3)ADDDPADDDP (2.7)(2.7)DPINTDPINT (1.4)(1.4)DPSPDPSP (1.4)(1.4)INTDPINTDP (1.5)(1.5)INTDPUINTDPU (1.5)(1.5)

.L Unit.L UnitINTSP (1.4)(1.4)INTSPU (1.4)(1.4)SPINT (1.4)(1.4)SPTRUNC (1.4)(1.4)SUBSP (1.4)(1.4)SUBDP (2.7)(2.7)

.L Unit.L UnitINTSP (1.4)(1.4)INTSPU (1.4)(1.4)SPINT (1.4)(1.4)SPTRUNC (1.4)(1.4)SUBSP (1.4)(1.4)SUBDP (2.7)(2.7)

ADDSPADDSP (1.3)(1.3)ADDDPADDDP (2.7)(2.7)DPINTDPINT (1.4)(1.4)DPSPDPSP (1.4)(1.4)INTDPINTDP (1.5)(1.5)INTDPUINTDPU (1.5)(1.5)

.D Unit.D UnitADDADADDAD (1.1) (1.1) LDDWLDDW (1.5) (1.5)

.D Unit.D UnitADDADADDAD (1.1) (1.1) LDDWLDDW (1.5) (1.5)

C67x Latencies: (unit.instruction)

e.g., MPYSP (1.4) means a single precision float multiplyrequires a single function unit latency and three delay slots.

ECE 5655/4655 Real-Time DSP 3–27

C ProgrammingThe section will focus on some of the uses of the C6x develop-ment tools and some of the compiler, assembler, and linker set-tings.

• As stated at the beginning of this chapter, the use of C codecan achieve from 80–100% the efficiency of hand assembly

– Further optimization, what is discussed in this section, willlikely be required, but it is safe to say that C code is a goodstarting point for algorithm development

• Recall the basic code building tool layout is:

• When the compiler tools are coupled with Code ComposerStudio (CCS) we have a compete development environment:

.out.out.out.outLinkerLinker

.obj.obj

Link.cmdLink.cmd

LinkerLinker.obj.obj

Link.cmdLink.cmd

EditorEditor

.sa.sa

AsmAsmOptimizerOptimizer

.sa.sa

AsmAsmOptimizerOptimizer

.c / ..c / .cppcpp

CompilerCompiler

.c / ..c / .cppcpp

CompilerCompiler

.c / ..c / .cppcpp

CompilerCompiler

AsmAsm.asm.asm

AsmAsm.asm.asm

Chapter 3 • TMS320C6x Programming

3–28 ECE 5655/4655 Real-Time DSP

• The output code can be controlled with a very large numberof options that span the compiler, assemble, and linker

PLU

G IN

S (C

++, V

B, J

ava)

PLU

G IN

S (C

++, V

B, J

ava)

PLU

G IN

S (C

++, V

B, J

ava)

PLU

G IN

S (C

++, V

B, J

ava)

LinkLinkAsmAsm

CompileCompileAsm OptoAsm Opto

EditEdit

DSPDSPBoardBoard

Debug

SIMProbe In

Probe OutGraphsProfiling

DSK

EVM

Third Party

XDS

LinkLinkAsmAsm

CompileCompileAsm OptoAsm Opto

EditEdit

DSPDSPBoardBoard

Debug

SIM

Debug

SIMSIMProbe In

Probe In

Probe Out

Probe OutGraphsGraphsProfilingProfiling

DSK

EVM

Third Party

DSK

EVM

Third Party

XDSXDSStudio Includes:Studio Includes:�� Code Generation ToolsCode Generation Tools�� BIOS:BIOS: RealReal--time kerneltime kernel

RealReal--time analysis (time analysis (RTARTA))

BIOSBIOSLibraryLibrary

Studio Includes:Studio Includes:�� Code Generation ToolsCode Generation Tools�� BIOS:BIOS: RealReal--time kerneltime kernel

RealReal--time analysis (time analysis (RTARTA))

BIOSBIOSLibraryLibrary

�� SimulatorSimulator�� Simulator, PlugSimulator, Plug--ins, ins, ÆÆRTDXRTDX

LinkLinkAsmAsmCompileCompilefile.cfile.cfile.c file.outfile.outfile.outLinkLinkAsmAsmCompileCompilefile.cfile.cfile.c file.outfile.outfile.out

Indicates how output file should be constructed� Which Optimizations� Where to find files/libs� ‘C62x or ‘C67x� How to link files� Etc.

(Old CCS Interface shown)

C Programming

ECE 5655/4655 Real-Time DSP 3–29

Debug options

• All total there are about five pages of options in the compileruser manual

Optimize Options

• When first debugging code we typically use -gs (above),later optimization can be turned on, e.g., -o3

OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab

debugdebugdebugdebug

--mv6700 Generate ‘C6700 code (‘C6200 is default)mv6700 Generate ‘C6700 code (‘C6200 is default) CompilerCompiler--fr <dir>fr <dir> Directory containing source filesDirectory containing source files CompilerCompiler--gg EnablesEnables srcsrc--level symbolic debugginglevel symbolic debugging Comp/Comp/AsmAsm--ss InterlistInterlist C statements into assembly listingC statements into assembly listing CompilerCompiler--kk Keep assembly fileKeep assembly file CompilerCompiler--mgmg Enables minimum debug to allow profilingEnables minimum debug to allow profiling CompilerCompiler--mtmt NoNo aliasingaliasing usedused CompilerCompiler--o3o3 Invoke optimizer (Invoke optimizer (--o0, o0, --o1, o1, --o2/o2/--o, o, --o3)o3) CompilerCompiler--pmpm Combine all C source files before compileCombine all C source files before compile CompilerCompiler--gsgs--gsgs

OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab

speedspeedoptoopto

speedspeedoptoopto

--mv6700 Generate ‘C6700 code (‘C6200 is default)mv6700 Generate ‘C6700 code (‘C6200 is default) CompilerCompiler--frfr <dir><dir> Directory containing source filesDirectory containing source files CompilerCompiler--gg EnablesEnables srcsrc--level symbolic debugginglevel symbolic debugging Comp/Comp/AsmAsm--ss InterlistInterlist C statements into assembly listingC statements into assembly listing CompilerCompiler--kk Keep assembly fileKeep assembly file CompilerCompiler--mgmg Enables minimum debug to allow profilingEnables minimum debug to allow profiling CompilerCompiler--mtmt NoNo aliasingaliasing usedused CompilerCompiler--o3o3 Invoke optimizer (Invoke optimizer (--o0, o0, --o1, o1, --o2/o2/--o, o, --o3)o3) CompilerCompiler--pmpm Combine all C source files before compileCombine all C source files before compile CompilerCompiler--msms Minimize code size (Minimize code size (--ms0/ms0/--ms, ms, --ms1, ms1, --ms2)ms2) CompilerCompiler--oi0oi0 Disables automatic functionDisables automatic function inlininginlining CompilerCompiler

--k k --mgt mgt --o3 o3 --pmpm--k k --mgt mgt --o3 o3 --pmpm

Chapter 3 • TMS320C6x Programming

3–30 ECE 5655/4655 Real-Time DSP

Code Size

Assembler Options

OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab

sizesizeoptoopto

--mv6700 Generate ‘C6700 code (‘C6200 is default)mv6700 Generate ‘C6700 code (‘C6200 is default) CompilerCompiler--frfr <dir><dir> Directory containing source filesDirectory containing source files CompilerCompiler--gg EnablesEnables srcsrc--level symbolic debugginglevel symbolic debugging Comp/Comp/AsmAsm--ss InterlistInterlist C statements into assembly listingC statements into assembly listing CompilerCompiler--kk Keep assembly fileKeep assembly file CompilerCompiler--mgmg Enables minimum debug to allow profilingEnables minimum debug to allow profiling CompilerCompiler--mtmt NoNo aliasingaliasing usedused CompilerCompiler--o3o3 Invoke optimizer (Invoke optimizer (--o0, o0, --o1, o1, --o2/o2/--o, o, --o3)o3) CompilerCompiler--pmpm Combine all C source files before compileCombine all C source files before compile CompilerCompiler--msms Minimize code size (Minimize code size (--ms0/ms0/--ms, ms, --ms1, ms1, --ms2)ms2) CompilerCompiler--oi0oi0 Disables automatic functionDisables automatic function inlininginlining CompilerCompiler

--k k --mgt mgt --ms0 ms0 --o3 o3 --oi0 oi0 --pmpm--k k --mgt mgt --ms0 ms0 --o3 o3 --oi0 oi0 --pmpm

OptionsOptions DescriptionDescription CC TabCC Tab

-- gg Enables srcEnables src--level symbolic debugginglevel symbolic debugging Comp/AsmComp/Asm-- ll Create assembler listing file (small Create assembler listing file (small --L)L) AssemblerAssembler-- ss Retain asm symbols for debuggingRetain asm symbols for debugging AssemblerAssembler

OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab

-- gg Enables srcEnables src--level symbolic debugginglevel symbolic debugging Comp/AsmComp/Asm-- ll Create assembler listing file (small Create assembler listing file (small --L)L) AssemblerAssembler-- ss Retain asm symbols for debuggingRetain asm symbols for debugging AssemblerAssembler

--glsgls--glsgls

C Programming

ECE 5655/4655 Real-Time DSP 3–31

Linker Options

Summary of Popular Options

OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab

-- oo <file><file> Output file nameOutput file name LinkerLinker-- mm <file><file> Map file nameMap file name LinkerLinker-- cc AutoAuto--initialize global/static C variablesinitialize global/static C variables LinkerLinker

Options Description Options Tab

debug

speedopto

-mv6700 Generate ‘C6700 code (‘C6200 is default) Compiler-fr <dir> Directory containing source files Compiler-g Enables src-level symbolic debugging Comp/Asm-s Interlist C statements into assembly listing Compiler-k Keep assembly file Compiler-mg Enables minimum debug to allow profiling Compiler-mt No aliasing used Compiler-o3 Invoke optimizer (-o0, -o1, -o2/-o, -o3) Compiler-pm Combine all C source files before compile Compiler-ms Minimize code size (-ms0/-ms, -ms1, -ms2) Compiler-oi0 Disables automatic function inlining Compiler -l Create assembler listing file (small -L) Assembler-s Retain asm symbols for debugging Assembler-o <dir> Output file name Linker-m <dir> Map file name Linker-c Auto-Init C variables (-cr turns off autoinit) Linker

sizeopto

Options Description Options TabOptions Description Options Tab

debugdebug

speedopto

-mv6700 Generate ‘C6700 code (‘C6200 is default) Compiler-fr <dir> Directory containing source files Compiler-g Enables src-level symbolic debugging Comp/Asm-s Interlist C statements into assembly listing Compiler-k Keep assembly file Compiler-mg Enables minimum debug to allow profiling Compiler-mt No aliasing used Compiler-o3 Invoke optimizer (-o0, -o1, -o2/-o, -o3) Compiler-pm Combine all C source files before compile Compiler-ms Minimize code size (-ms0/-ms, -ms1, -ms2) Compiler-oi0 Disables automatic function inlining Compiler -l Create assembler listing file (small -L) Assembler-s Retain asm symbols for debugging Assembler-o <dir> Output file name Linker-m <dir> Map file name Linker-c Auto-Init C variables (-cr turns off autoinit) Linker

sizeopto

Chapter 3 • TMS320C6x Programming

3–32 ECE 5655/4655 Real-Time DSP

• A block diagram depicting what happens when a projectbuild takes place is shown below:

Embedded Systems with CConsider software systems development in terms of the C6x

• An embedded system, for the purposes of C6x development,consists of:

– Program (algorithm and data structures)

– Initialization

– Memory management

• The program part seems pretty clear

• The initialization and memory management part are beyondwhat you find in a typical host programming environment,such as Visual C++ on a PC

-ofile.out

file.cfile.c Compiler

file.obj

-sfile.asm

-alfile.lst

Assembler

Linker-z

-m file.map

-o COptimizer

Run-timeLibrary(boot.c)

Embedded Systems with C

ECE 5655/4655 Real-Time DSP 3–33

• From a C programming perspective on a host, once the sys-tem resets and initializes, we only deal with the program

– In the embedded world we have to also deal with initializa-tion

– We have more flexibility this way, and we only need toinclude the hardware and software really needed to get thejob done

– Using only the hardware and software that is needed alsoprovides a cost savings

• The reset operation

– Stops the processor,

reset vectorreset reset vectorvector

reset

resetpinpin reset

vectorreset reset vectorvector

reset

resetpinpinreset

resetpinpin

Initialize System

Initialize Initialize SystemSystem

Initialize System

Initialize Initialize SystemSystem

ProgramProgramProgramProgramProgramProgram

short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;

main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;

}}

short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;

main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;

}}

short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;

main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;

}}

CodeCode

GlobalGlobalVariablesVariables InitialInitial

ValuesValues

LocalLocalVariablesVariables

DynamicDynamicVariablesVariables

Basic Sections ofBasic Sections ofC fileC file

CodeCode

GlobalGlobalVariablesVariables InitialInitial

ValuesValues

LocalLocalVariablesVariables

DynamicDynamicVariablesVariables

CodeCode

GlobalGlobalVariablesVariables InitialInitial

ValuesValues

LocalLocalVariablesVariables

DynamicDynamicVariablesVariables

Basic Sections ofBasic Sections ofC fileC file

Chapter 3 • TMS320C6x Programming

3–34 ECE 5655/4655 Real-Time DSP

– brings some registers back to a preset state,

– sets the program counter (PC) to zero, and

– begins running code (address 0)

Initialization Under C

• The C compiler run-time support library contains the routineboot.c

– Note, global variables are optionally initialized through acompiler switch

resetreset

pinpin

Initialize System

Initialize Initialize SystemSystem

reset vectorreset reset vectorvector

short m = 10;short b = 2;short y = 0;

main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;

}

short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;

main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;

}}

resetreset

pinpinresetreset

pinpin

Initialize System

Initialize Initialize SystemSystem

reset vectorreset reset vectorvector

short m = 10;short b = 2;short y = 0;

main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;

}

short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;

main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;

}}

boot.cboot.cboot.cboot.c

1.1. Initialize PointersInitialize Pointers(discussed in mod 11)(discussed in mod 11)�� stackstack�� heapheap�� global/staticglobal/static

2.2. Initialize global and staticInitialize global and staticvariablesvariables

3.3. Call _mainCall _main

_main_main

Embedded Systems with C

ECE 5655/4655 Real-Time DSP 3–35

• Following the actual hardware reset, the software begins toreset via vectors.asm via a branch to c_int00

– Note that c_int00 is defined in the C library

– Note also that when using CCS and debugging the target,e.g., the DSK, some of this functionality is automaticallytaken care of

• NOP’s are added to fill the fetch packet

• Each interrupt vector is aligned on the fetch packet boundar-ies

• Other interrupts, which are typically also part of this file,will be discussed later

boot.cboot.cboot.c

reset vectorreset reset vectorvector

short m = 10;short b = 2;short y = 0;

main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;

}

short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;

main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;

}}

_main_main

1. Init stack, heap, 1. Init stack, heap, & global ptrs& global ptrs

2. init variables2. init variables3. call _main3. call _main

boot.cboot.cboot.c

reset vectorreset reset vectorvector

short m = 10;short b = 2;short y = 0;

main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;

}

short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;

main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;

}}

_main_main

1. Init stack, heap, 1. Init stack, heap, & global ptrs& global ptrs

2. init variables2. init variables3. call _main3. call _main

resetreset

pinpin 00resetreset

pinpinresetreset

pinpin 00

vectors.asmvectors.asmvectors.asmvectors.asm

bb _c_int00_c_int00nopnop 55

_c_int00_c_int00

nopnopnopnopnopnopnopnopnopnopnopnop

One One

Fetch PacketFetch Packet

.global _c_int00.global _c_int00

.sect “vectors” .sect “vectors”

Chapter 3 • TMS320C6x Programming

3–36 ECE 5655/4655 Real-Time DSP

Compiler Sections

• The system software is broken into modules of code and dataknown as sections

• The sections as found in a typical C program are shownbelow:

• The above names seem reasonable, but the compiler usesnames associated with the common object files format(coff) developed many years ago by AT&T for use with Cand Unix

• The real names used by the C6x complier tools are the fol-lowing:

HardwareHardware SoftwareSoftwareHardwareHardware SoftwareSoftware

‘C6x‘C6x‘C6x

MemoryMemoryMemory

ROMROMROM

RAMRAMRAMRAMRAMRAM

RAMRAMRAM

PeriphPeriphPeriph

‘C6x‘C6x‘C6x

MemoryMemoryMemory

ROMROMROM

RAMRAMRAMRAMRAMRAM

RAMRAMRAM

PeriphPeriphPeriphProgramProgramProgram

CodeCodeCode

DataDataData

ProgramProgramProgram

CodeCodeCode

DataDataData Init Values(global)

Init ValuesInit Values(global)(global)

Variables(global)

VariablesVariables(global)(global)

Stack(local)StackStack(local)(local)

Heap(dynamic)

HeapHeap(dynamic)(dynamic)

Init Values(global)

Init ValuesInit Values(global)(global)

Variables(global)

VariablesVariables(global)(global)

Stack(local)StackStack(local)(local)

Heap(dynamic)

HeapHeap(dynamic)(dynamic)

Init Values(global)

Init ValuesInit Values(global)(global)

Variables(global)

VariablesVariables(global)(global)

Stack(local)StackStack(local)(local)

Heap(dynamic)

HeapHeap(dynamic)(dynamic)

C Code(main.c)C CodeC Code(main.c)(main.c)

System Init(boot.c)

System InitSystem Init(boot.c)(boot.c)

Vectors(reset)

VectorsVectors(reset)(reset)

System Init(boot.c)

System InitSystem Init(boot.c)(boot.c)

Vectors(reset)

VectorsVectors(reset)(reset)

System Init(boot.c)

System InitSystem Init(boot.c)(boot.c)

Vectors(reset)

VectorsVectors(reset)(reset)

Embedded Systems with C

ECE 5655/4655 Real-Time DSP 3–37

– The reset section can be any name, but vectors is rea-sonable

• The complete list of C compiler sections is:

ProgramProgramProgram

CodeCodeCode

DataDataData Init Values(global)

Init ValuesInit Values(global)(global)

Variables(global)

VariablesVariables(global)(global)

Stack(local)StackStack(local)(local)

Heap(dynamic)

HeapHeap(dynamic)(dynamic)

C Code(main.c)C CodeC Code(main.c)(main.c)

System Init(boot.c)

System InitSystem Init(boot.c)(boot.c)

Vectors(reset)

VectorsVectors(reset)(reset)

ProgramProgramProgram

CodeCodeCode

DataDataData

ProgramProgramProgram

CodeCodeCode

DataDataData Init Values(global)

Init ValuesInit Values(global)(global)

Variables(global)

VariablesVariables(global)(global)

Stack(local)StackStack(local)(local)

Heap(dynamic)

HeapHeap(dynamic)(dynamic)

Init Values(global)

Init ValuesInit Values(global)(global)

Variables(global)

VariablesVariables(global)(global)

Stack(local)StackStack(local)(local)

Heap(dynamic)

HeapHeap(dynamic)(dynamic)

Init Values(global)

Init ValuesInit Values(global)(global)

Variables(global)

VariablesVariables(global)(global)

Stack(local)StackStack(local)(local)

Heap(dynamic)

HeapHeap(dynamic)(dynamic)

C Code(main.c)C CodeC Code(main.c)(main.c)

System Init(boot.c)

System InitSystem Init(boot.c)(boot.c)

Vectors(reset)

VectorsVectors(reset)(reset)

System Init(boot.c)

System InitSystem Init(boot.c)(boot.c)

Vectors(reset)

VectorsVectors(reset)(reset)

System Init(boot.c)

System InitSystem Init(boot.c)(boot.c)

Vectors(reset)

VectorsVectors(reset)(reset)

.stack.stack.stack.stack

.sysmem.sysmem.sysmem.sysmem

.cinit.cinit.cinit.cinit

.bss.bss.bss.bss

.text.text.text.text

????youryourchoicechoice

.bss.bss

.text.text

.cinit.cinit

Global and static variablesGlobal and static variables

CodeCode

Initial values for global/static varsInitial values for global/static vars

DescriptionDescriptionSection Section NameName

.stack.stack Stack (local variables)Stack (local variables)

.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap)

.switch.switch Tables for switch instructionsTables for switch instructions.switch.switch Tables for switch instructionsTables for switch instructions

.const.const Global and static Global and static sstring literalstring literals.const.const Global and static Global and static sstring literalstring literals

.far.far Global and statics declared Global and statics declared farfar.far.far Global and statics declared Global and statics declared farfar

.cio.cio Buffers for stdio functionsBuffers for stdio functions.cio.cio Buffers for stdio functionsBuffers for stdio functions

Chapter 3 • TMS320C6x Programming

3–38 ECE 5655/4655 Real-Time DSP

• A possible section placement solution for the C6201:

• A more generalized way of describing the memory sections isto use the terms initialized and uninitialized as opposed toROM and RAM, i.e.,

‘C6201‘C6201

EPROMEPROM

.cinit.cinit.const.const.text.text

.switch.switch

CE0CE0

EPROMEPROM

.cinit.cinit.const.const.text.text

.switch.switch

EPROMEPROM

.cinit.cinit.const.const.text.text

.switch.switch

CE0CE0

SDRAMSDRAM

.sysmem.sysmem.far .far .cio .cio

CE2CE2

SDRAMSDRAM

.sysmem.sysmem.far .far .cio .cio

SDRAMSDRAM

.sysmem.sysmem.far .far .cio .cio

CE2CE2

.bss .bss 8000_00008000_0000(data RAM)(data RAM) .stack .stack

.bss .bss 8000_00008000_0000(data RAM)(data RAM) .stack .stack

.text.text.text

.switch.switch.switch

.const.const.const

.cinit.cinit.cinit

.bss.bss.bss

.far.far.far

.stack.stack.stack

.sysmem.sysmem.sysmem

.cio.cio.cio

140_0000140_0000(prog RAM)(prog RAM)140_0000140_0000(prog RAM)(prog RAM)

Many other solutionspossible; the C67xx?

.bss.bss

.text.text

.cinit.cinit

Global and static variablesGlobal and static variables

CodeCode

Initial values for global/static varsInitial values for global/static vars

uninitializeduninitialized

initializedinitialized

initializedinitialized

DescriptionDescriptionSection Section NameName

MemoryMemoryTypeType

.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized

.const.const Global and static Global and static sstring literalstring literals initializedinitialized

.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized

.stack.stack Stack (local variables)Stack (local variables) uninitializeduninitialized

.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized

.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized

.bss.bss

.text.text

.cinit.cinit

Global and static variablesGlobal and static variables

CodeCode

Initial values for global/static varsInitial values for global/static vars

uninitializeduninitialized

initializedinitialized

initializedinitialized

DescriptionDescriptionSection Section NameName

MemoryMemoryTypeType

.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized.switch.switch Tables for switch instructionsTables for switch instructions.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized

.const.const Global and static Global and static sstring literalstring literals initializedinitialized.const.const Global and static Global and static sstring literalstring literals.const.const Global and static Global and static sstring literalstring literals initializedinitialized

.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized.far.far Global and statics declared Global and statics declared farfar.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized

.stack.stack Stack (local variables)Stack (local variables) uninitializeduninitialized.stack.stack Stack (local variables)Stack (local variables).stack.stack Stack (local variables)Stack (local variables) uninitializeduninitialized

.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap).sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized

.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized.cio.cio Buffers for stdio functionsBuffers for stdio functions.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized

Embedded Systems with C

ECE 5655/4655 Real-Time DSP 3–39

Memory Management

• We control the physical mapping of memory to program anddata sections sections via a linker command file

• The linker command file .cmd has two parts

.cmd.cmd.cmd.cmd

LinkerLinker.obj.obj.obj.obj

.map.map--mm

.out.out--ooLinkerLinker.obj.obj

.obj.obj.obj.obj.obj.obj

.map.map--mm

.map.map--mm

.out.out--oo

.out.out--oo

MemoryMemoryMemory

‘C6x‘C6x‘C6x

MemoryMemoryMemory

ROMROMROM

RAMRAMRAMRAMRAMRAM

RAMRAMRAM

PeriphPeriphPeriph

‘C6x‘C6x‘C6x

MemoryMemoryMemory

ROMROMROM

RAMRAMRAMRAMRAMRAM

RAMRAMRAM

PeriphPeriphPeriph

‘C6x‘C6x‘C6x

MemoryMemoryMemory

ROMROMROM

RAMRAMRAMRAMRAMRAM

RAMRAMRAM

PeriphPeriphPeriph

.obj.obj.obj.obj

MEMORYMEMORY{ {

Memory DescriptionMemory Description

}}

SECTIONSSECTIONS{{

Binding Code/Data Sections to MemoryBinding Code/Data Sections to Memory

}}

Chapter 3 • TMS320C6x Programming

3–40 ECE 5655/4655 Real-Time DSP

• In the memory description portion we create a description ofboth processor and system resources

• Each line is of the formname:origin = address, length = size-in-bytes

– Note that we can shorten origin to simply o or org, andlength to simply len or l, i.e., consider the memoryportion of the C6711 command file we have used thus farMEMORY{

vecs: org = 00000000h , len = 220h IRAM: org = 00000220h , len = 0000fdc0h CE0: org = 80000000h , len = 01000000h

FLASH: org = 90000000h , len = 00020000h}

– Quantities may be specified in hex or decimal, but hex ispreferred, e.g., 100h or 0x100

• Note: The vectors section must come first, so that followingreset, initialization can occur

• The vecs space must be at least 200 hex long since on theC6x there are a total of 16 interrupts, each requiring one fetchpacket of 8, 32-bit instructions ( )

– Here the 220h leaves room for 32 bits more

– There will be more discussion of interrupts later

• To understand the rest of the memory space assignments,recall the C6x11 memory map

16 32× 200h=

Embedded Systems with C

ECE 5655/4655 Real-Time DSP 3–41

• On the C6x13 DSK we frequently place all of the sections,program and data, in the internal RAM (IRAM)

SECTIONS{ vectors :> vecs .text :> IRAM .bss :> IRAM .cinit :> IRAM .stack :> IRAM .sysmem :> SDRAM .const :> IRAM .switch :> IRAM .far :> SDRAM .cio :> SDRAM}

• Note some sections are placed in the SDRAM of CE0

FFFF_FFFFFFFF_FFFF

0000_00000000_000064K x 8 Internal64K x 8 Internal

(L2)(L2)

OnOn--chip Peripheralschip Peripherals0180_00000180_0000

256M x 8 External2

256M x 8 External3

8000_00008000_0000

9000_00009000_0000

A000_0000A000_0000

B000_0000B000_0000

256M x 8 External0

256M x 8 External1

FFFF_FFFFFFFF_FFFF

0000_00000000_000064K x 8 Internal64K x 8 Internal

(L2)(L2)

OnOn--chip Peripheralschip Peripherals0180_00000180_0000

256M x 8 External2 256M x 8 External2

256M x 8 External3 256M x 8 External3

8000_00008000_0000

9000_00009000_0000

A000_0000A000_0000

B000_0000B000_0000

256M x 8 External0 256M x 8 External0

256M x 8 External1 256M x 8 External1

64K64KUnifiedUnifiedRAMRAM

CPUCPU

4K4KProgramProgramCacheCache

4K4KDataData

CacheCache

64K64KUnifiedUnifiedRAMRAM

CPUCPU

4K4KProgramProgramCacheCache

4K4KDataData

CacheCache

The 6713DSKhas 16Mat8000_0000

C67xx Memory MapThe 6713DSKhas 264kBstarting at0000_0000

Chapter 3 • TMS320C6x Programming

3–42 ECE 5655/4655 Real-Time DSP

Linker Options

• In the third tab of the project options dialog box, we set linkeroptions

• The -o specifies the executable file, e.g., norm_sq_c.out

• The -m creates a map file which shows in detail how thelinker has located everything in memory

Embedded Systems with C

ECE 5655/4655 Real-Time DSP 3–43

• The -c option, run-time autoinitialization, invokes BOOT.Cso that variables are autoinitialized, that is initial values in the.cinit section are copied into the .bss section

– We can turn of autoinit by using -cr

• -stack sets the size of the stack, e.g., .stack section; thedefault is 0x400

• -heap sets the size of the heap, which is actually the .sys-mem section, has a default value of 0x400

• -q supresses the banner display and -w has the linkerexhaustively read all libraries

Chapter 3 • TMS320C6x Programming

3–44 ECE 5655/4655 Real-Time DSP

Calling Assembly with CBeing able to call assembly routines from C is a powerful capa-bility of the compiler tools. In this section we explore the mainpoints.

• For more detail refer to spru187t or newer, TMS320C6000Optimizing Compiler v 7.3: User's Guide

– Sections 7.4 & 7.5

• To begin with all C labels are accessed in the assembly filewith an underscore (_) character, e.g., sum --> _sum

• To call an assembly routine requires that we follow a fewsimple rules

• Things we would like to do are:

– Pass arguments in

– Return results

– Access C’s global variables in assembly

• More advanced issues, not dealt with here, are use of andaccess to the stack and optimal access to global variables

main( )main( ){{

}}

_asm_asmFunction:Function:

bb

Calling Assembly with C

ECE 5655/4655 Real-Time DSP 3–45

• To find a function we have a global (inter-file) reference

• To pass variables in, take a return value, and return to the par-ent code flow, we use a set of argument/register passing rules

Child.C

int child(int a, int b){

return(a + b);}

Child.CChild.C

int child(int a, int b)int child(int a, int b){{

return(a + b);return(a + b);}}

Child.ASMChild.ASM

.global.global _child_child

_child: _child:

; end of subroutine; end of subroutine

�� UseUse __underscoreunderscore�� Make label Make label globalglobal

Parent.C

int child(int, int);int x = 7, y, w = 3;

void main (void){

y = child(x, 5);}

Parent.CParent.C

int child(int, int);int child(int, int);int x = 7, y, w = 3;int x = 7, y, w = 3;

void main (void)void main (void){ {

y = child(x, 5);y = child(x, 5);}}

...assembly code...

arg1/arg1/r_valr_val

arg3arg3

arg5arg5

arg7arg7

arg9arg9

ret addrret addrarg2arg2

arg4arg4

arg6arg6

arg8arg8

arg10arg10

112233445566778899

101011111212131314141515

00AA BB

arg1/arg1/r_valr_val

arg3arg3

arg5arg5

arg7arg7

arg9arg9

ret addrret addrarg2arg2

arg4arg4

arg6arg6

arg8arg8

arg10arg10

112233445566778899

101011111212131314141515

00AA BBAA BB

�� Arguments are passed in Arguments are passed in registers as shownregisters as shown

�� Return value in A4Return value in A4and return to addressand return to addressin B3in B3

Child.C

int child(int a, int b){

return(a + b);}

Child.CChild.C

int child(int a, int b)int child(int a, int b){{

return(a + b);return(a + b);}}

Chapter 3 • TMS320C6x Programming

3–46 ECE 5655/4655 Real-Time DSP

• A simple example

• Accessing C global variables in assembly:

Child.C

int child(int a, int b){

return(a + b);}

Child.CChild.C

int child(int a, int b)int child(int a, int b){{

return(a + b);return(a + b);}}

Child.ASMChild.ASM

.global _child.global _child_child:_child:

addadd a4a4,,b4b4,,a4a4bb b3b3nopnop 55

; end of subroutine; end of subroutine

�� ArgumentsArguments�� Return/ResultReturn/Result

Child.ASMChild.ASM

.global _child.global _child_child:_child:

addadd a4a4,,b4b4,,a4a4bb b3b3nopnop 55

; end of subroutine; end of subroutine

�� ArgumentsArguments�� Return/ResultReturn/Result

Parent.C

int child(int, int);int x = 7, y, w = 3;

void main (void){

y = child(x, 5);}

Parent.CParent.C

int child(int, int);int child(int, int);int x = 7, y, w = 3;int x = 7, y, w = 3;

void main (void)void main (void){ {

y = child(y = child(x, 5x, 5););}}

�� Declare Declare globalglobal labelslabels�� Use _Use _underscoreunderscore when accessing C variables (labels)when accessing C variables (labels)�� Advantages of declaring variables in C?Advantages of declaring variables in C?

�� Declaring in C is easierDeclaring in C is easier�� Compiler does variable initCompiler does variable init ( ( int w = 3 int w = 3 ))

Parent.C

int child2(int, int);int x = 7, y, w = 3;

void main (void){

y = child2(x, 5);}

Parent.CParent.C

int child2(int, int);int child2(int, int);int x = 7, y, int x = 7, y, w = 3w = 3;;

void main (void)void main (void){ {

y = child2(x, 5);y = child2(x, 5);}}

Child2.ASM

.global _child2

.global _w

_child2:mvkl _w , A1mvkh _w , A1ldw *A1, A0

Child2.ASMChild2.ASM

.global _child2.global _child2

.global _w.global _w

_child2:_child2:mvklmvkl _w , A1_w , A1mvkhmvkh _w , A1_w , A1ldwldw *A1, A0*A1, A0

Calling Assembly with C

ECE 5655/4655 Real-Time DSP 3–47

• Registers A10–A15 and B10–B15 must be saved/preserved

• There is actually a bit more to this (see below), but more later

112233445566778899

101011111212131314141515

00AA BBAA BB

These must be saved and These must be saved and restored if you use them restored if you use them

in Assemblyin Assembly

00112233445566778899101011111212131314141515

arg1/arg1/r_valr_val

arg3arg3

arg5arg5

arg7arg7

arg9arg9

AA

ret addrret addrarg2arg2

arg4arg4

arg6arg6

arg8arg8

arg10arg10

DPDP

BB

SPSP

arg1/arg1/r_valr_val

arg3arg3

arg5arg5

arg7arg7

arg9arg9

AA

arg1/arg1/r_valr_val

arg3arg3

arg5arg5

arg7arg7

arg9arg9

AA

ret addrret addrarg2arg2

arg4arg4

arg6arg6

arg8arg8

arg10arg10

DPDP

BB

SPSP

ret addrret addrarg2arg2

arg4arg4

arg6arg6

arg8arg8

arg10arg10

DPDP

BB

SPSP

extraextraargumentsarguments

StackStack

PriorPriorStackStack

ContentsContents

extraextraargumentsarguments

StackStack

PriorPriorStackStack

ContentsContents

Chapter 3 • TMS320C6x Programming

3–48 ECE 5655/4655 Real-Time DSP

Linear Assembly and Assembly OptimizationBeing able to call highly efficient linear assembly routines fromC is another powerful capability of the compiler tools. In thissection we explore the main points.

• Linear assembly has the ease of C programming (almost) andthe efficiency approaching that of assembly, but without toomany headaches, as the tools do a lot of the work

• The development flow for linear assembly modules

• Features of linear assembly for subroutines include:

– Pass parameters

– Return results

– Use symbolic variable names

– Ignore pipeline issues (delay slots)

– Automatically return to the calling function

– Call other functions written in C or linear assembly

AssemblerAssemblerAssembler LinkerLinkerLinker.obj.obj .out.out

.c / ..c / .cppcpp

.asm.asm

Link.cmdLink.cmd.sa.sa

CompilerCompilerCompiler

AsmOptimizer

AsmAsmOptimizerOptimizer

TextEditorTextText

EditorEditor AssemblerAssemblerAssembler LinkerLinkerLinker.obj.obj .out.out

.c / ..c / .cppcpp

.asm.asm

Link.cmdLink.cmd.sa.sa

CompilerCompilerCompiler

AsmOptimizer

AsmAsmOptimizerOptimizer

TextEditorTextText

EditorEditor

Linear Assembly and Assembly Optimization

ECE 5655/4655 Real-Time DSP 3–49

• Consider a simple dot product example in C

• Rewriting in linear assembly (typically a .sa file) we have

– Assembly directives are required :(

– Functional unit management is not needed :)

– Register management not needed :)

intint DotP(short *m, short *n, DotP(short *m, short *n, intint count)count)

{ int i;{ int i;

intint product;product;

intint sum = 0;sum = 0;

for (i=0; i < count; i++)for (i=0; i < count; i++)

{{

product = m[i] * n[i];product = m[i] * n[i];

sum += product;sum += product;

}}

return(sum);return(sum);

}}

intint DotP(short *m, short *n, DotP(short *m, short *n, intint count)count)

{ int i;{ int i;

intint product;product;

intint sum = 0;sum = 0;

for (i=0; i < count; i++)for (i=0; i < count; i++)

{{

product = m[i] * n[i];product = m[i] * n[i];

sum += product;sum += product;

}}

return(sum);return(sum);

}}

_dotp: zero sum

loop: ldh *pm++, mldh *pn++, nmpy m, n, prodadd prod, sum, sum

sub count, 1, count[count] b loop

__dotpdotp:: zerozero sumsum

loop:loop: ldhldh *pm++, m*pm++, mldhldh **pnpn++, n++, nmpympy m, n, prodm, n, prodaddadd prod, sum, sumprod, sum, sum

subsub count, 1, countcount, 1, count[count] [count] bb looploop

Chapter 3 • TMS320C6x Programming

3–50 ECE 5655/4655 Real-Time DSP

• A special directive .cproc is used to declare the passedvariables, e.g.,

.cproc arg1, arg2, arg3

• The directive .endproc declares the end of the routine

• Symbolic names can be used throughout, which is very nice

• The completed dot product example

• The above performs the function

short dotp(short *a, short *x, int count)

_dotp: .cproc pm, pn, count

.reg m, n, prod, sum

zero sum

loop:

ldh *pm++, mldh *pn++, nmpy m, n, prodadd prod, sum, sum

sub count, 1, count[count] b loop

.return sum

.endproc

__dotpdotp:: ..cproccproc pm, pm, pnpn, count, count

..regreg m, n, prod, sum m, n, prod, sum

zerozero sumsum

loop:loop:

ldhldh *pm++, m*pm++, mldhldh **pnpn++, n++, nmpympy m, n, prodm, n, prodaddadd prod, sum, sumprod, sum, sum

subsub count, 1, countcount, 1, count[count] [count] bb looploop

.return.return sumsum

..endprocendproc

Example: Vector Norm Squared

ECE 5655/4655 Real-Time DSP 3–51

Calling from Linear Assembly

• Linear assembly can also call another subroutine

Linear Assembly Compiler Settings

• Specific assembly optimizer options are:

– Use -g -s for algorithm verification

– Use -k -mgt -o3 -pm for software pipelining

Example: Vector Norm SquaredIn this example we will be computing the squared length of avector using 16-bit (short) signed numbers. In mathematicalterms we are finding

__dotpdotp:: ..cproccproc..regreg valval

mvkmvk 5,5, valval

.call.call valval = _= _testcalltestcall((valval))

.return.return valval

..endprocendproc

_testcall:_testcall: ..cproccproc inputinput

addadd input, 5, inputinput, 5, input

.return.return input input

.endproc.endproc

Chapter 3 • TMS320C6x Programming

3–52 ECE 5655/4655 Real-Time DSP

(3.1)

where

(3.2)

is an -dimensional vector (column or row vector).

• The solution will be obtained in three different ways:

– Conventional C programming

– C6x assembly

– C6x linear assembly

• Optimization is not a concern at this point

• The focus here is to see by way of a simple example, how tocall a C routine from C (obvious), how to call an assemblyroutine from C, and how to call and write a simple linearassembly routine from C

C Version

• We implement this simple routine in C using a declared vec-tor length N and vector contents in the array A

• The C source, which includes the called function norm_sqis given below

A 2 An2

n 1=

N

¦=

A A1 … AN=

N

Example: Vector Norm Squared

ECE 5655/4655 Real-Time DSP 3–53

/******************************************************

Vector norm-squared routine in C

******************************************************/

#include <stdio.h>short norm(short *A, int N);

int main(){

int N = 5;short A[5] = {1, 2, 3, 6, 7};short norm_sq;norm_sq = norm(A, N);printf("Vector norm squared = %d",norm_sq);return 0;

}

short norm(short* V, int n){

int i;short out = 0;for(i=0; i<n; i++){

out += V[i]*V[i];}return out;

}

• The expected answer is 1 4 9 36 49+ + + + 99=

Chapter 3 • TMS320C6x Programming

3–54 ECE 5655/4655 Real-Time DSP

Running in CCS 5.1: The C code is put into a project for run-ning on the OMAP-L138 or the simulator as Norm_Squaredand debugged and profiled

• From the watch window we obtain the following when westep the program to the last line

Example: Vector Norm Squared

ECE 5655/4655 Real-Time DSP 3–55

• Enable the clock under the run menu to profile

• The cycle count at the function call level for the norm_sqfunction call is 152 in the simulator, did not try hardware

Starting address of array in memory

Other active windows in CCS 5.1

Time from 1st to 2ndbreakpoint

Chapter 3 • TMS320C6x Programming

3–56 ECE 5655/4655 Real-Time DSP

Assembly Version

• The parent C routine is the following:/******************************************************

Vector norm-squared routine in assembly

******************************************************/

#include <stdio.h>

short norm_asm(short *A, int N);

int main(){

int N = 5;short A[5] = {1, 2, 3, 6, 7};short norm_sq;

norm_sq = norm_asm(A, N);

printf("Vector norm squared = %d",norm_sq);

return 0;}

• From just the C source it is not obvious that the function pro-totype for norm_asm is actually an assembly routine

• The assembly routine is the following:; Vector norm in assembly

.global _norm_asm ;reference name from C

_norm_asm:

Example: Vector Norm Squared

ECE 5655/4655 Real-Time DSP 3–57

mv .l2 B4, B1 ;put loop ctr. in a proper reg.zero .l1 A2 ;initialize accumulator

loop:ldh .d1 *A4++, A1 ;ld vals pointed to by A4 in A1nop 4 ;required ldh delaympy .m1 A1, A1, A3;square each valuenop ;required mpy delayadd .l1 A3, A2, A2;accumulate the squared valuessub .l2 B1, 1, B1 ;decrement the loop counter

[B1]b .s2 loop ;branch until B1 == 0 nop 5 ;required branch delay

mv .d1 A2, A4 ;move result to return reg. A4b .s2 B3 ;branch back to address at B3nop 5 ;required branch delay

• Note that each line of assembly code takes the followingform:

label: || [cond] instruction .unit operand ;comment

– Labels must start in the first column, up to 200 characters,and must begin with a letter, the colon is optional

• When accessing from C the register calling convention isobserved, that is, when we enter the functionnorm_asm(arg1, arg2),

– arg1, is a pointer or address to the first value of the arrayA, and is stored in register A4

– arg2 is an int value, e.g., a full 32-bit signed integer,and is stored in register B4

• Since arg2 is the array dimension, we will use it as the loopcounter starting value

Chapter 3 • TMS320C6x Programming

3–58 ECE 5655/4655 Real-Time DSP

• B4 is not a suitable register for loop control, so we move(mv) the value stored in B4, in this case to B1

• We initialize the accumulator register, A2, using zero instruc-tion, alternatively mvk .s1 0,A2 works as well

• Starting at the top of the loop section, we begin by loading(ldh since we only have 16-bits) the values pointed to by A4into working register A1

– The pointer A4 is post incremented by just 2-bytes or 16-bits address steps following the load operation

– The default increment size is controlled by the data type,here it is halfwords (16-bits)

– Various pre- and post-increment options are available,including the offset amount, and wether it modifies theoriginal pointer or not (see the table below)

Example: Vector Norm Squared

ECE 5655/4655 Real-Time DSP 3–59

• To satisfy the pipeline delays, we follow the ldh with 4NOP’s

• Next, we perform a 16-bit multiply (MPY), actually a squar-ing; the result is stored in A3

• To satisfy the pipeline we follow the MPY with one NOP

• We accumulate the result into register A2 using ADD

• Next, we branch to loop subject to the state of B1

• The branch is followed by five NOP’s to satisfy the pipelinedelay

Table 3.1: Pointer incrementing methods; A1 showna

a. If [disp] is omitted the displacement is one unit of the data type, other-wise the displacement is by integer multiples of Word, Halfword, or Byte. If (disp) is used in stead of [disp] the displacement is (disp) bytes.

SyntaxPointer changed

Description

*A1 no Basic pointer

*+A1[disp] no +Pre-offset

*-A1[disp] no -Pre-offset

*++A1[disp] yes Pre-increment

*--A1[disp] yes Pre-decrement

*A1++[disp] yes Post-increment

*A1--[disp] yes Post-decrement

Chapter 3 • TMS320C6x Programming

3–60 ECE 5655/4655 Real-Time DSP

• Finally, the squared and accumulated value held in A2 issaved to the return register A4

• To return back to the C module, we must branch to theaddress saved in B3

• If we had needed to use registers A10–A15 or B10–B15, wewould of had to save and restore them accordingly

• The final numerical result is again 99

Running in CCS 2: The C code is put into a project for runningon the 6711 DSK as norm_sq_asm.pjt, and debugged andprofiled

• The profiling results of the new norm_sq function are:

• With the assembly routine the cycle count is reduced to 91,which as a ratio makes the C routine 152/91 = 1.67 timesslower, assuming no optimization

• With optimization the tables are turned and the C is faster bythe factor ?

Example: Vector Norm Squared

ECE 5655/4655 Real-Time DSP 3–61

The Linear Assembly Version

• The parent C calling routine is again of the form:/******************************************************

Vector norm-squared routine in linear assembly

******************************************************/

#include <stdio.h>short norm_sa(short *A, int N);

int main(){

int N = 5;short A[5] = {1, 2, 3, 6, 7};short norm_sq;norm_sq = norm_sa(A, N);printf("Vector norm squared = %d",norm_sq);return 0;

}

• The assembly routine is the following:; Vector norm in linear assembly

.global _norm_sa;reference name from C

_norm_sa:.cproc A, N ;input variables.reg m, sum ;working variableszero sum ;zero the accumulator

loop:

ldh *A++, m ;load values pointed to by A

Chapter 3 • TMS320C6x Programming

3–62 ECE 5655/4655 Real-Time DSP

mpy m, m, m ;square each valueadd m, sum, sum;accumulate the squared valuessub N, 1, N ;decrement the loop counter

[N]b loop ;branch until N == 0

.return sum ;return value

.endproc ;end linear assembly routine

• The function/subroutine is declared .global just as in theassembly case

• Following the assembly label _norm_sa, we begin the lin-ear assembly routine with .cproc followed by the inputvariables (may be dummy names);

• Working variables are declared using .reg

• The accumulator is cleared using the assembler instructionzero

• A loop is then set up in a similar fashion to the pure assemblyversion, except now the precise management of the registersis left to the assembly optimizer

• There is also no need to include NOP’s

• As before the final answer is 99

Running in CCS 2: The C code is put into a project for runningon the 6711 DSK as norm_sq_sa.pjt, and debugged andprofiled

Example: Vector Norm Squared

ECE 5655/4655 Real-Time DSP 3–63

• The profiling results of the new norm_sq function are:

• This result is very similar to the assembly result (on the 671390 .sa & 91 .asm)

• With say -o3 optimization the linear assembly is faster by theratio ?

• When debugging a linear assembly routine it is best to use themixed mode to display assembly interlisted with C and/or lin-ear assembly

• The registers window can then be used to watch what is hap-pening when the code is stepped

Chapter 3 • TMS320C6x Programming

3–64 ECE 5655/4655 Real-Time DSP