1Kurt Keutzer
Lecture 11: Interfaces, I/O and
Configurable Processors
Professor Kurt Keutzer
Computer Science 252
Spring 2000
With contributions from Prof. David Patterson
Niraj Shah, Scott Weber
2Kurt Keutzer
Embedded Systems vs. General Purpose Computing - 1
Embedded System
• Runs a few applications often
known at design time
• Not end-user programmable
• Operates in fixed run-time
constraints, additional
performance may not be
useful/valuable
General purpose computing
•Intended to run a fully general
set of applications
• End-user programmable
• Faster is always better
3Kurt Keutzer
Embedded Systems vs. General Purpose Computing - 2
Embedded System
Differentiating features:
power
cost
speed (must be predictable)
General purpose computing
Differentiating features
speed (need not be fully predictable)
speed
did we mention speed?
cost (largest component power)
4Kurt Keutzer
Configurabilty and Embedded Systems
Advantages of configuration:
• Pay (in power, design time, area) only for what you use
• Gain additional performance by adding features tailored to
your application:
Particularly for embedded systems:
Principally in embedded controller microprocessor applications
Some us in DSP
5Kurt Keutzer
What to Configure?
What parts of the microcontroller/microprocessor system to
configure?
Easy answers:
• Memory and Cache Sizes - get precisely the sizes your
applications needs
• Register file sizes
• Interrupt handling and addresses
Harder answers:
• Peripherals
• Instructions
But first we need more context
6Kurt Keutzer
I/O Interrupts
An I/O interrupt is just like the exception handlers except:
An I/O interrupt is asynchronous
Further information needs to be conveyed
An I/O interrupt is asynchronous with respect to instruction execution:
I/O interrupt is not associated with any instruction
I/O interrupt does not prevent any instruction from completion You can pick your own convenient point to take an interrupt
I/O interrupt is more complicated than exception:
Needs to convey the identity of the device generating the interrupt
Interrupt requests can have different urgencies: Interrupt request needs to be prioritized
7Kurt Keutzer
add $r1,$r2,$r3subi $r4,$r1,#4slli $r4,$r4,#2
Hiccup(!)
lw $r2,0($r4)lw $r3,4($r4)add $r2,$r2,$r3sw 8($r4),$r2
Raise priorityReenable All IntsSave registers
lw $r1,20($r0)lw $r2,0($r1)addi $r3,$r0,#5sw $r3,0($r1)
Restore registersClear current IntDisable All IntsRestore priorityRTI
Ext
ern
al I
nte
rru
pt
PC saved
Disable
All Ints
Superviso
r Mode
Restore PC
User Mode
“In
terr
up
t H
and
ler”
Example: Device Interrupt
Advantage: User program progress is only halted during actual transfer
Disadvantage, special hardware is needed to: Cause an interrupt (I/O device) Detect an interrupt (processor) Save the proper states to resume after the interrupt (processor)
8Kurt Keutzer
Interrupt Driven Data TransferCPU
IOC
device
Memory
addsubandornop
readstore...rti
memory
userprogram(1) I/O
interrupt
(2) save PC
(3) interruptservice addr
interruptserviceroutine(4)
Device xfer rate = 10 MBytes/sec => 0 .1 x 10 sec/byte => 0.1 µsec/byte => 1000 bytes = 100 µsec 1000 transfers x 100 µsecs = 100 ms = 0.1 CPU seconds
-6
User program progress only halted during actual transfer
1000 transfers at 1 ms each: 1000 interrupts @ 2 µsec per interrupt 1000 interrupt service @ 98 µsec each = 0.1 CPU seconds
Still far from device transfer rate! 1/2 in interrupt overhead
9Kurt Keutzer
Better Way to Handle Interrupts?
Handling all interrupts with CPU could bring it to a halt in a
real time system
Isn’t there a better way?
Hint, remember the trickledown theory of embedded
processor architecture.
10Kurt Keutzer
Trickle Down Theory of Embedded Architectures
Mainframe/supercomputers
High-end servers/workstations
High-end personal computers
Personal computers
Lap tops/palm tops
Gadgets
Watches
...
Features tend to trickle down:• #bits: 4->8->16->32->64• ISA’s• Floating point support• Dynamic scheduling• Caches• I/O controllers/processors• LIW/VLIW• Superscalar
11Kurt Keutzer
I/O Interface
Independent I/O Bus
CPU
Interface Interface
Peripheral Peripheral
Memory
memorybus
Separate I/O instructions (in,out)
CPU
Interface Interface
Peripheral Peripheral
Memory
Lines distinguish between I/O and memory transferscommon memory
& I/O bus
VME busMultibus-IINubus
40 Mbytes/secoptimistically
10 MIP processorcompletelysaturates the bus!
12Kurt Keutzer
Delegating I/O Responsibility from the CPU: IOP
CPU IOP
Mem
D1
D2
Dn
. . .main memory
bus
I/Obus
CPU
IOP
(1) Issuesinstructionto IOP
memory
(2)
(3)
Device to/from memorytransfers are controlledby the IOP directly.
IOP steals memory cycles.
OP Device Address
target devicewhere cmnds are
IOP looks in memory for commands
OP Addr Cnt Other
whatto do
whereto putdata
howmuch
specialrequests
(4) IOP interrupts CPU when done
13Kurt Keutzer
Memory Mapped I/O
Single Memory & I/O Bus No Separate I/O Instructions
CPU
Interface Interface
Peripheral Peripheral
Memory
ROM
RAM
I/O$
CPU
L2 $
Memory Bus
Memory Bus Adaptor
I/O bus
14Kurt Keutzer
Delegating I/O Responsibility from the CPU: DMA
Direct Memory Access (DMA):
External to the CPU
Act as a master on the bus
Transfers blocks of data to or from memory without CPU intervention
CPU
IOC
device
Memory DMAC
CPU sends a starting address, direction, and length count to DMAC. Then issues "start".
DMAC provides handshakesignals for PeripheralController, and MemoryAddresses and handshakesignals for Memory.
15Kurt Keutzer
Direct Memory Access
CPU
IOC
device
Memory DMAC
Time to do 1000 xfers at 1 msec each:
1 DMA set-up sequence @ 50 µsec1 interrupt @ 2 µsec1 interrupt service sequence @ 48 µsec
.0001 second of CPU time
CPU sends a starting address, direction, and length count to DMAC. Then issues "start".
DMAC provides handshake signals for PeripheralController, and Memory Addresses and handshakesignals for Memory.
0ROM
RAM
Peripherals
DMACn
Memory Mapped I/O
16Kurt Keutzer
68332 Family
68K was the most successful embedded controller in
history
CISC instruction set - good code density
Table lookup for compressed tables
Time processing unit - breakthrough in modular peripheral
handling!
17Kurt Keutzer
MC68332 - Top level
inter module busIMB
I/0 - channel 0
I/0 - channel 15
unitTPU
time processingCPU32
serial I/0
IMB control RAM
TPU
Designed for automotive applications with mixture of computation intensive tasks and complex I/0 -functions Idea: off-load CPU from frequent I/0 interactions to make use of computation performance:
18Kurt Keutzer
68332 CPU Block Diagram
19Kurt Keutzer
Addressing Modes in 68332
Seven modes
• Register direct
• Register indirect
• Register indirect with index
• Program counter indirect with displacement
• Program counter indirect with Index
• Absolute
• Immediate
Why so many modes? Antiquated architectural feature?
20Kurt Keutzer
Addressing Modes in 68332
Seven modes
• Register direct
• Register indirect
• Register indirect with index
• Program counter indirect with displacement
• Program counter indirect with Index
• Absolute
• Immediate
Complex addressing modes allow for more dense code … but …MCore - Mot’s embedded micocontroller rewrite uses simple DLX-like
Load Store instructions - code size impact?
21Kurt Keutzer
MC68332 Time Processing Unit
IMB
Data
Control ServiceRequests
Microengine
HostInterface
TimerChannelsScheduler
DevelopmentSupportand Test
SystemConfiguration
ChannelControl
ParameterRAM
Store
ExecutionUnit
Channel 0
Channel 1
Channel 15
Pins
Control andData
Ch
ann
elControlStore
timebase
TPU: time processing unit: peripheral coprocessor
independent programmable timer channels: single-shot "capture & compare"
channel coupling and sequence control with control processor
pin
22Kurt Keutzer
Time Processing Unit
23Kurt Keutzer
Time Processing Unit
Semi-autonomous microcontroller
Operates concurrently with CPU
• Schedules tasks
• Processes ROM instructions
• Accesses shared data with CPU
• Performs Input/Output
24Kurt Keutzer
Uses of Time Processing Unit
Programmable series of two operations
• Match
• Capture
Each operation is called an ``event’’
A pre-programmed series of event is called a ``function’’
Pre-programmed functions
• Input capture/input transition counter
• Output compare
• Period measurement with addition/missing transition detect
• Position synchronized pulse-generator
• Period/pulse-width accumulator
25Kurt Keutzer
Time Bases
Two sixteen-bit counters
provide time bases for all
Pre-scalers controlled by CPU
via bit-fiels in TPU module
configuration register
TPUCMR
Current values accessible via
TCR1 and TCR2 registers
TCR1, TCR2 can be read/written
by TPU microcode- not
available to CPU
TC1 qualified by system clock
TC2 qualified by system clock
or external clock
26Kurt Keutzer
Timer Channels
Sixteen channels
- each one connect to a MCU
pin
Each channel has symmetric
hardware:
• Event register
16-bit capture register
16-bit compare/match register
16-bit comparator
• Pin control logic - pin
direction determined by
TPU microengine
27Kurt Keutzer
Scheduler
Determines which of sixteen
channels is serviced by the
microenginer
Channel can request service
for one of four reasons
host service
link to another channel
match event
capture event
• Host system assigns to each
channel a priority
high
middle
low
28Kurt Keutzer
Microengine
Determines which of sixteen
channels is serviced by the
microenginer
Channel can request service
for one of four reasons
host service
link to another channel
match event
capture event
• Host system assigns to each
channel a priority
high
middle
low
29Kurt Keutzer
Another Motorola Microprocessor
30Kurt Keutzer
Concepts so far ...
• Interrupts
• Memory Mapping of I/O
• Time Processing Unit / Peripheral Processor
other configurable elements
Peripherals
Instructions
31Kurt Keutzer
Configurability in ARM Processor
ARM allows for configurability via AMBA bus
Offers ``prime cell’’ peripherals which hook into AMBA
Peripheral Bus (APB)
• UART
• Real Time Clock
• Audio Codec Interface
• Keyboard and mouse interface
• General purpose I/O
• Smart card interface
• Generic IR interface
http://www.arm.com/Pro+Peripherals/PrimeCell/index.html
32Kurt Keutzer
ARM7 core
33Kurt Keutzer
ARM’s Amba open standard
Advanced System Bus, (ASB) - high performance, CPU, DMA, external
Advanced Peripheral Bus, (APB) - low speed, low power, parallel I/O, UART’s
External interface
http://www.arm.com/Documentation/Overviews/AMBA_Intro/#intro
34Kurt Keutzer
Ex1: ARM Infrared (IR) Interface
35Kurt Keutzer
Ex 2: ARM Smart Card Interface
36Kurt Keutzer
Ex 3: Audio Codec
37Kurt Keutzer
Another Kind of Configurability
RTLSynthesis
HDL
netlist
logicoptimization
netlist
Library
physicaldesign
layout
Synthesis of a processor core from an RTL description allows for:
• full range of other types of configurability
• additional degrees of freedom in quality of implementation
Examples:
• ARM7
• Motorola Coldfire
• Tensilica Xtensa
38Kurt Keutzer
Quality of Results Tradeoffs
Delay
Area
Synthesizable implementationallows for explanation of a widerange of implementations
39Kurt Keutzer
ARM Core7 Thumb Embedded
40Kurt Keutzer
Ultimate configurabilty :The tensilica solution:
Fast, safetailoring of
coresExtensibility with
synchronization tothe hardware
DSP andperipheral
blocksuP
GeneratoruP
Generator
uPCores
uPCores
Pre-verifiedfunctionlibrary
Pre-verifiedfunctionlibrary
S/Wdevelopmentenvironment
S/Wdevelopmentenvironment
Ultra small andefficient, newarchitectures
41Kurt Keutzer
Tensilica Viterbi Implementation
Niraj Shah
Scott Weber
290A Final Presentation
42Kurt Keutzer
Tensilica Flow
.c
.o xt-run
.c.c
gen uArch Designer
gen
xt-gcc
TIE
TensilicaProcessorGenerator
43Kurt Keutzer
Xtensa Architecture
XtensaCore
Rs Rt RrI
TIE
TIE Extensions:
single cycle
state free
no new exceptions
no stalls
typeless data
Rs, Rt, Rr are 32 bit regs
I is the instruction controlling the
TIE unit
Xtensa Core is a 32 bit configurable
RISC processor
44Kurt Keutzer
Viterbi Architecture
ACS
TraceBackRAMInit
ADCI/0
Device
MeasuredMeasuredPerformancePerformance
HereHere
45Kurt Keutzer
TIE SetupBMreg (ACS)
-++
31 8:7 0I
Rs Rt
Rr
31 8:7 0Q
bm33123:2415:167:80
bm2bm1bm0
-
0x7F0x7F
-
Controlinstruction
46Kurt Keutzer
ACS TIE Extension (ACS)
+
+
bm331 24:23 16:15 8:7 0
bm2 bm1 bm0
17
pm- pm-
11 1:027
-=1?
11:12
pm
310:10’s
decision bitdecision bit
ACS03 ||ACS12 ||ACS30 ||ACS21
31
instruction
RtRs
Rr
msbmsb
47Kurt Keutzer
ACS TIE Extension with State (ACS)
bm331 24:2316:15 8:7 0
bm2 bm1 bm0
+
+
17pm- pm-
1127
-=1?
31Rs
msbmsb
+
+
17pm-pm-
11 27
- =1?
31Rt
msbmsb
11
pm
310:1
decision bitdecision bit
Rr
pm
16:17
0:11:0
27
decision bitdecision bit
Control
instruction
48Kurt Keutzer
TIE Zmask (TraceBack)
&
31 1:0Rs Rt
Rr
31 6:5 0
6:70
|
0x7F0x7F
<<1<<1
&0x3F0x3F
31
Controlinstruction
49Kurt Keutzer
Designs
All designs had a BER of 0.000095 after 10 million iterations
Design 1
100 MHz, 48 mW, 1K DCache, 1K ICache, TIE
Design 1+
222 MHz, 144 mW, 1K DCache, 1K ICache, TIE
Design 2-
100 MHz, 69 mW, 16K DCache, 16K ICache, TIE
Design 2
222 MHz, 191 mW, 16K DCache, 16K ICache, TIE
Design 3
222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state
50Kurt Keutzer
Performance
118
409
263
909
357409
793
909966
1142
0
200
400
600
800
1000
1200
Design
1
Design
1+
Design
2-
Design
2
Design
3
Cache
Perfect CacheKb/sKb/s
51Kurt Keutzer
Energy Dissipation
uJ/bituJ/bit
0.4
0.12
0.54
0.160.19
0.17
0.240.21 0.2
0.17
0
0.1
0.2
0.3
0.4
0.5
0.6
Design
1
Design
1+
Design
2-
Design
2
Design
3
Cache
Perfect Cache
52Kurt Keutzer
n(s*J)/Bit
n(s*J)/n(s*J)/BitBit
3.39
0.293
2.05
0.176
0.5320.416
0.3150.231 0.2070.148
0
0.5
1
1.5
2
2.5
3
3.5
Design
1
Design
1+
Design
2-
Design
2
Design
3
Cache
Perfect Cache
53Kurt Keutzer
Die Area
2.1 2.12.372.37
6.146.14
6.7 6.7 6.7 6.7
0
1
2
3
4
5
6
7
Design
1
Design
1+
Design
2-
Design
2
Design
3
Cache
Perfect Cachemmmm22
54Kurt Keutzer
Summary: Levels of Configurabilty
Configurability is highly desirable in embedded
applications
There are many levels of configuration:
• Memory and Cache Sizes - get precisely the sizes your
applications needs
• Register file sizes
• Interrupt handling and addresses
• Peripherals
• Instructions
• Physical implementation