lisarm: embedded arm platform design and optimizationpolitecnico di torino iii facoltµa di...

POLITECNICO DI TORINO

III Facolta di IngegneriaCorso di Laurea in Ingegneria Elettronica

Tesi di Laurea

LISARM: embedded ARMplatform design and optimization

Relatori:Prof. Guido MaseraIng. Maurizio MartinaIng. Fabrizio Vacca

Candidato:Carlo Ceriani

Aprile 2007

A mia madre, a mio padre e . . .

a chi ha avuto fiducia in me

I

Acknowledgements

Il primo e piu grande ringraziamento va a mia madre, per il fondamentale supportodatomi in questi lunghi anni di studi, per non avermi mai fatto mancare la propriafiducia ed avermi saputo dare i giusti stimoli, soprattutto nei momenti piu difficili. Inqueste righe non posso non ricordare mio padre, in particolare per avermi insegnatoche, rimboccandosi le maniche ed avendo fiducia nelle proprie capacita, ci si puosempre spingere oltre, allargando i propri orizzonti.

Ringrazio il mio relatore, prof. Guido Masera, ed i miei corelatori, MaurizioMartina e Fabrizio Vacca, per le essenziali consulenze, per avermi saputo indirizzarenegli snodi cruciali del mio lavoro e per avermi messo a disposizione le risorse di cuinecessitavo.

Ringrazio gli altri componenti del VLSILab, con i quali ho avuto il piacere dicondividere questa esperienza, per essersi sempre dimostrati disponibili a risolvereuna moltitudine di ordinari problemi presentatisi. Un particolare ringraziamento vaa Federico Quaglio, per l’aiuto che mi ha dato sia nella fase di ricerca e sviluppo delprogetto, che in quella di stesura di questo elaborato.

Trattandosi dell’atto conclusivo di un lungo percorso di studi, ma anche e soprat-tutto per suggellare un tratto importante della mia vita, ringrazio tutti coloro chein questo cammino hanno saputo arricchire la mia vita di conoscenza, di esperienza,ma anche semplicemente di piacevoli momenti di svago.

I

Summary

The diffusion of electronic devices in many aspects of the common life has deeplychanged not only the industrial production constraints but also the technologiesthe applications required by the market are based on. Although System-on-Chiptechnology allows to put heterogeneous components on the same die, the devel-opment time of hardwired technologies and the noteworthy constraints imposedby economic return reasons, have led to find new approaches. Hardware-softwarepartitioning is one of the most applied techniques; it allows to divide the targetapplication complexity on two different levels: powerful and flexible programmablesystem design and complex algorithm implementation for the market demand satis-faction. The development phase must be performed in a coordinated way betweenthe designer groups, so that this approach can ensure reduced times for the productimplementation. Other constraints, for power consumption and occupied area, arealso important, particularly for mobile devices which have to give long endurance forbatteries and higher performance with respect to preceding applications, as requiredby the customers. In this technology branch, microprocessor based platforms arethe most diffused and the ARM7TDMI processor represents a successful product,thanks to its noteworhty performance and low power characteristics. Embeddedprocessors use is not the unique solution, although architectures available on themarket furnish many of the characteristics requested by manufacturers, sometimesthey are not tailor-made for critical applications or their structure is too complex,with dramatic effects on power consumption and area occupation. A different so-lution is represented by the ASIPs, i.e. processors specifically designed for targetapplications, that provide a dedicated instruction-set, built on the software algo-rithms which have to be executed on them. The programmable architectures designuses particular software environments which allow to describe the instruction-set in aflexible manner, enabling the code reuse by writing it with an Architecture Descrip-tion Language like LISA 2.0. LISATek Toolsuite and Language for Instruction-SetArchitecture allow the processor behavior description in all its aspects, also by atemporal point of view, integrating technologies like pipelining and caching andallowing to obtain an hardware description in HDL, a powerful simulator and allthe dedicated tools for software development. Aim of this thesis work is to explore

II

the possibilities offered by the software environment in the development of a pro-grammable platform based on the ARM7 processor, whose available documentation,due to a number of its applications, allows to analyse in-depth the characteristicsto be transferred to the model.

Chapter 1 Contains a brief review of all the topics treated in this thesis and anextended summary in italian, as required by the university rules for foreign languagethesis.

Chapter 2 Introduces some concepts about computer architectures, reportingsome historical outlines about the evolution of computers and microprocessors.

Chapter 3 Describes the ARM7TDMI processor from its programmer’s model toan in-depth architecture analysis, describing its instruction set and the core inter-facing with external systems.

Chapter 4 Introduces the LISATek toolsuite, a powerful software environment forASIP modeling, the principal instrument used for the LISARM development andverification.

Chapter 5 Describes the LISARM processor model by reporting the guidelines fol-lowed in the development of its various parts and the architectural solutions adoptedto obtain a coherent ARM7 model, for both behavior and internal structure.

Chapter 6 Describes the tools obtained from the model description by using theLISATek automatic generation tools and some external solutions for the compati-bility issues, like memory wrapping and toolchain adaption.

Chapter 7 Contains some conclusive considerations about the thesis work andtraces some hypothesis about future applications of the produced material.

III

Contents

Acknowledgements I

Summary II

1 Sintesi 11.1 Introduzione . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 L’architettura dei processori RISC . . . . . . . . . . . . . . . . . . . . 21.3 Architettura del microprocessore ARM7 . . . . . . . . . . . . . . . . 81.4 L’ambiente di sviluppo LISATek . . . . . . . . . . . . . . . . . . . . . 141.5 Il modello LISA dell’ARM7 . . . . . . . . . . . . . . . . . . . . . . . 191.6 Strumenti di sviluppo per ARM7 . . . . . . . . . . . . . . . . . . . . 261.7 Conclusioni e sviluppi futuri . . . . . . . . . . . . . . . . . . . . . . . 29

2 The RISC microprocessor architecture 312.1 The Von Neumann architecture . . . . . . . . . . . . . . . . . . . . . 312.2 Harvard architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3 The increased processor complexity . . . . . . . . . . . . . . . . . . . 342.4 The RISC architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 362.5 Pipelining and cache technology . . . . . . . . . . . . . . . . . . . . . 412.6 RISC vs CISC architecture . . . . . . . . . . . . . . . . . . . . . . . . 45

3 The ARM microprocessor architecture 493.1 The ARM processor family . . . . . . . . . . . . . . . . . . . . . . . . 503.2 The Thumb concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3 The programmer model . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Operating states and state switching . . . . . . . . . . . . . . 533.3.2 Memory formats and data types . . . . . . . . . . . . . . . . . 533.3.3 Operating modes . . . . . . . . . . . . . . . . . . . . . . . . . 543.3.4 Processor resources . . . . . . . . . . . . . . . . . . . . . . . . 553.3.5 The Processor Status Registers (PSRs) . . . . . . . . . . . . . 56

3.4 The exception handling . . . . . . . . . . . . . . . . . . . . . . . . . . 57

IV

3.4.1 Processor reset . . . . . . . . . . . . . . . . . . . . . . . . . . 603.4.2 Interrupt and fast interrupt requests . . . . . . . . . . . . . . 603.4.3 Abort conditions . . . . . . . . . . . . . . . . . . . . . . . . . 613.4.4 Software interrupts and supervisor mode . . . . . . . . . . . . 623.4.5 Undefined instruction . . . . . . . . . . . . . . . . . . . . . . . 623.4.6 Exception priorities . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5 ARM instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5.1 Conditional execution . . . . . . . . . . . . . . . . . . . . . . 633.5.2 Branch and exchange (BX) . . . . . . . . . . . . . . . . . . . . 643.5.3 Branch and branch with link (B-BL) . . . . . . . . . . . . . . 663.5.4 Data processing instructions . . . . . . . . . . . . . . . . . . . 673.5.5 PSR transfer instructions . . . . . . . . . . . . . . . . . . . . . 713.5.6 Multiply and multiply and accumulate (MUL-MLA) . . . . . . 733.5.7 Multiply and multiply and accumulate long (MULL-MLAL) . 753.5.8 Single data transfer operations (LDR-STR) . . . . . . . . . . 773.5.9 Halfword and signed data transfer operations . . . . . . . . . 793.5.10 Block data transfer operations (LDM-STM) . . . . . . . . . . 803.5.11 Single data swap (SWP) . . . . . . . . . . . . . . . . . . . . . 823.5.12 Software interrupt . . . . . . . . . . . . . . . . . . . . . . . . 833.5.13 Coprocessor instructions . . . . . . . . . . . . . . . . . . . . . 833.5.14 Undefined instruction . . . . . . . . . . . . . . . . . . . . . . . 84

3.6 Thumb instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . 853.7 The memory interface . . . . . . . . . . . . . . . . . . . . . . . . . . 863.8 The coprocessor interface . . . . . . . . . . . . . . . . . . . . . . . . . 893.9 The debugging system . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4 LISATek toolsuite 944.1 The ASIP design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.2 Architecture exploration . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 The architecture description: the LISA language . . . . . . . . . . . . 99

4.3.1 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . 994.3.2 Resource model . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.3.3 Instruction-set model . . . . . . . . . . . . . . . . . . . . . . . 1024.3.4 Behavioral model . . . . . . . . . . . . . . . . . . . . . . . . . 1034.3.5 Timing model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.3.6 Microarchitecture model . . . . . . . . . . . . . . . . . . . . . 105

4.4 The LISATek model development tools . . . . . . . . . . . . . . . . . 1054.4.1 The Processor Designer . . . . . . . . . . . . . . . . . . . . . . 1054.4.2 The Instruction-set Designer . . . . . . . . . . . . . . . . . . . 1064.4.3 The Syntax Debugger . . . . . . . . . . . . . . . . . . . . . . 107

4.5 The architecture implementation . . . . . . . . . . . . . . . . . . . . 108

V

4.6 The application software design . . . . . . . . . . . . . . . . . . . . . 1104.6.1 Assembler and linker . . . . . . . . . . . . . . . . . . . . . . . 1104.6.2 Disassembler . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.6.3 Simulator: the “Processor Debugger” . . . . . . . . . . . . . . 1114.6.4 The C-Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.7 The system integration and verification . . . . . . . . . . . . . . . . . 114

5 The LISARM model 1165.1 The model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.1.1 Processor resources, interface, internal units . . . . . . . . . . 1175.1.2 The main LISA operation . . . . . . . . . . . . . . . . . . . . 1205.1.3 The coding tree and the decoding mechanism . . . . . . . . . 123

5.2 The processor datapath . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.2.1 The barrel shifter unit . . . . . . . . . . . . . . . . . . . . . . 1245.2.2 The arithmetic logic unit . . . . . . . . . . . . . . . . . . . . . 1275.2.3 The 32x8 bit multiplier . . . . . . . . . . . . . . . . . . . . . . 128

5.3 Other LISA operations . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.4 The branch instructions . . . . . . . . . . . . . . . . . . . . . . . . . 1315.5 Data processing instructions . . . . . . . . . . . . . . . . . . . . . . . 1335.6 PSR transfer instructions . . . . . . . . . . . . . . . . . . . . . . . . . 1365.7 Multiplication instructions . . . . . . . . . . . . . . . . . . . . . . . . 1385.8 Single data transfer instructions . . . . . . . . . . . . . . . . . . . . . 1405.9 Block data transfer instructions . . . . . . . . . . . . . . . . . . . . . 1455.10 The data swap instruction . . . . . . . . . . . . . . . . . . . . . . . . 1465.11 Software interrupt and undefined

instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6 LISARM support tools 1496.1 The ARM LISA simulator . . . . . . . . . . . . . . . . . . . . . . . . 1496.2 The memory wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 1506.3 ARM commercial toolchains . . . . . . . . . . . . . . . . . . . . . . . 1526.4 ARM model toolchain adaption . . . . . . . . . . . . . . . . . . . . . 1536.5 HDL generation and tests . . . . . . . . . . . . . . . . . . . . . . . . 155

7 Conclusions and possible future applications 1587.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.2 Possible future applications . . . . . . . . . . . . . . . . . . . . . . . 159

A Model LISA operations summary 161

Bibliography 166

VI

Chapter 1

Sintesi

1.1 Introduzione

La diffusione dei dispositivi elettronici in molti aspetti della vita comune ha cambia-to profondamente gli assetti della produzione industriale e le tecnologie che stannoalla base delle applicazioni che il mercato richiede. Sebbene le tecnologie System-on-Chip abbiano dato grandi possibilita di integrare componenti elettronici anchemolto eterogenei su un singolo die, i tempi di sviluppo delle tecnologie hardwired egli stringenti vincoli imposti dalle logiche di ritorno economico hanno condotto allaricerca di nuovi approcci. La tecnica attualmente piu utilizzata e probabilmentequella dell’hardware-software partitioning, che consiste nel suddividere la comples-sita dell’applicazione da sviluppare su due livelli differenti: la progettazione di unsistema integrato programmabile potente e flessibile e la produzione di algoritmicomplessi in grado di soddisfare le esigenze del mercato. Lo studio dell’applicazionedeve avvenire in maniera coordinata tra i gruppi che ne curano il progetto e questoapproccio e in grado di garantire minori tempi di sviluppo del prodotto. Altre spe-cifiche, non meno importanti, riguardano il basso consumo ed il ridotto ingombroper i dispositivi portatili, che devono garantire grande autonomia a fronte di sempremaggiori prestazioni richieste dall’utente. In questo ambito tecnologico le piattafor-me programmabili basate su microprocessore sono tra le piu diffuse e l’ARM7TDMIrappresenta uno dei processori di maggior successo, grazie alle sue notevoli prestazio-ni ed alle sue caratteristiche di basso consumo. L’uso di processori embedded non e,comunque, l’unica soluzione tecnologica attualmente in uso; sebbene le architetturedisponibili sul mercato forniscano molte delle caratteristiche richieste dai produttoridel settore, talvolta essi non sono sufficientemente“su misura”per talune applicazionicritiche o hanno una struttura troppo complessa che va a discapito di occupazionedi area e consumo di potenza. L’alternativa e rappresentata dagli ASIP, ovvero daprocessori progettati per specifiche applicazioni che forniscono un set di istruzioni

1

1 – Sintesi

dedicato, costruito sulle esigenze degli algoritmi software che devono essere eseguiti.

Per la progettazione delle architetture programmabili si utilizzano ambienti disviluppo software che consentono di descrivere il set di istruzioni in maniera flessibilee riutilizzabile, mediante la stesura di codice in un Architecture Description Languagequale LISA 2.0. LISATek Toolsuite e Language for Instruction-Set Architectureconsentono di descrivere il funzionamento di un processore in tutti i suoi aspetti, siacomportamentali che temporali, integrando tecnologie attuali quali pipeline e cache econsentendo di ottenere una descrizione dell’hardware in HDL, un potente simulatoree i tool necessari per lo sviluppo del software dedicato. Scopo del presente lavoro ditesi e quello di esplorare le possibilita offerte da questo strumento software per laprogettazione di una piattaforma programmabile basata sul processore ARM7, la cuiampia documentazione, frutto delle numerose applicazioni su esso basate, consentedi analizzare a fondo le caratteristiche da riprodurre nel modello.

1.2 L’architettura dei processori RISC

L’architettura di Von Neumann e stata una delle architetture piu utilizzate fin dalleprimi pionieristici progetti di calcolatori automatici. Essa utilizza un’unica strutturaper la memorizzazione dei dati e del codice da eseguire e, nonostante la sua sempli-cita, ha rivoluzionato la concezione del calcolo automatico fino ad allora dominante,introducendo le instruction-set architecture (ISA). Prima della sua formalizzazione,infatti, i calcolatori funzionavano sulla base un programma predeterminato, ese-guendo una serie non modificabile di operazioni sui dati forniti in ingresso. Conl’introduzione delle architetture ad istruzioni, dei modelli programmazione e di unaserie di metodi per l’accesso alle risorse del sistema, l’insieme di operazioni da esegui-re durante la computazione sono state formalizzate tramite una sequenza di codicioperativi, scritti in linguaggio macchina. Il vantaggio principale di questo approccioe, evidentemente, la possibilita di evitare la riprogettazione della logica che realizzail sistema di calcolo al presentarsi di nuove esigenze computazionali.

Sebbene tale architettura sia stata di importanza epocale, nel mondo dei compu-ter, essa ha un aspetto negativo noto come Von Neumann bottleneck, con il quale siindica l’effetto“collo di bottiglia”dovuto alla differenza tra la velocita di esecuzionedelle istruzioni (throughput) del processore e la velocita di traferimento della me-moria, differenza dovuta alle diverse tecnologie utilizzate per la loro realizzazione.In talune applicazioni, quando il processore esegue un limitato numero di istruzio-ni su una rilevante quantita di dati, il collo di bottiglia dell’accesso alla memoriaintroduce una seria riduzione della velocita di elaborazione del sistema.

Contrariamente all’approccio di Von Neumann, l’architettura di Harvard utiliz-za un sistema di memorizzazione indipendente per il codice rispetto ai dati, ancheper quanto concerne i segnali di controllo ed i bus di comunicazione necessari. In

2

1 – Sintesi

un’architettura di Harvard le caratteristiche dei due tipi di memoria non devono ob-bligatoriamente essere identiche, ma possono differire in tecnologia implementativa,dimensione dei dati, tempo di accesso, metodo e logica di indirizzamento e struttu-ra. La caratteristica saliente di questo approccio e la possibilita offerta alla CPUdi poter leggere un’istruzione della memoria di codice mentre viene effettuato unaccesso in lettura o in scrittura su una locazione della memoria dati. Questa tecnicarende la struttura di Harvard mediamente piu veloce rispetto a quella vista in pre-cedenza ma, ovviamente, la complessita del sistema cresce e con essa l’occupazionedi area sul silicio. Anche questa architettura soffre degli effetti della velocita dellaCPU rispetto a quella della memoria per cui, se un programma deve accedere allamemoria ad ogni ciclo di clock, la velocita di esecuzione delle istruzioni raggiungibilee piu prossima alla velocita di accesso alla memoria che alle prestazioni offerte dallaCPU stessa.

ALU

accumulator

MAIN MEMORY

OUTPUTINPUT

UNIT

CONTROL

(a) (b)

Figura 1.1. Esempi di architetture: (a) Von Neumann, (b) Harvard

Il problema delle differenza di velocita del processore rispetto alla memoria disistema e stato uno dei primi aspetti che ha motivato il notevole incremento nellacomplessita delle architetture dei processori, soprattutto quando le prestazioni delleCPU sono diventate di un ordine di grandezza superiori rispetto ai piu veloci sistemidi memorizzazione. Per ridurre gli effetti di tale squilibrio, alcune procedure dialto livello che venivano eseguite tramite subroutine sono state integrate nei setdi istruzioni ed anche l’uso della microprogrammazione ha avuto ampia diffusione,grazie al fatto che le tecnologie d’integrazione hanno consentito di sopravanzare learchitetture che utilizzavano logica di esecuzione delle istruzioni totalmente cablata.

Tra gli altri aspetti che hanno motivato la sempre maggiore complessita dei mi-croprocessori commerciali va annoverata la necessita di mantenere la compatibilitacon precedenti membri di una stessa famiglia di processori, soddisfatta talvolta tra-mite microprogrammi di emulazione. Anche la crescente popolarita dei linguaggi

3

1 – Sintesi

di programmazione di alto livello ha richiesto maggiori prestazioni alle CPU, neltentativo di colmare il gap semantico tra le possibilita offerte dai set di istruzioniesistenti e quelle di sempre piu potenti linguaggi. Sempre nell’ottica di fornire mag-giore flessibilita e potenzialita ai programmatori, soprattutto quelli che utilizzavanodirettamente i linguaggi assembly, sono stati introdotti i set di istruzioni ortogona-li, ovvero capaci di accettare qualsiasi modo di indirizzamento della memoria perogni tipo di istruzione implementata. Il notevole impulso dato dai linguaggi di pro-grammazione di alto livello allo sviluppo di software applicativi, ha condotto allanecessita di avere memorie di sempre maggior capacita, con drammatici effetti suicosti degli interi sistemi. Allo scopo di ridurre tali effetti alcuni produttori hannoimpegnato notevoli energie per ottenere una maggior densita di codice istruzioni,benche questo tipo di istruzioni richiedano una maggior quantita di informazione dacodificare e quindi un maggior numero di bit, a discapito dello spazio di memoriarichiesto e della complessita della logica di decodifica ed esecuzione delle istruzioni.L’uso di set di istruzioni complessi ha incrementato anche la quantita di informazionida salvare ad ogni evento di interrupt, con un conseguente aumento del numero diregistri (shadow registers) e del microcodice necessario per la loro gestione. Questoaspetto e fortemente critico per applicazioni quali i Digital Signal Processors (DSP)ed i microcontrollori per automazione, dove per la funzionalita dei sistemi la latenzamassima deve essere garantita.

In contrasto alla crescente complessita dei microprocessori del tempo, all’iniziodegli anni ’80, venne introdotto un nuovo e piu semplice approccio architetturaledenominato Reduced Instruction Set Computer (RISC) che implementava un set diistruzioni ridotto rispetto ai prodotti classici. In considerazione del fatto che questearchitetture utilizzano solo due istruzioni per l’accesso ai dati in memoria, esse sonoanche dette load-store architecture ed hanno la caratteristica di eseguire tutte leoperazioni tra registri e non direttamente con operandi contenuti in memoria. Inantitesi con l’acronimo suddetto si inizio ad indicare le architetture classiche conil nome di Complex Instruction Set Computer (CISC). I RISC sono caratterizza-ti dall’avere un tipo di codifica delle istruzioni uniforme, con istruzioni tutte dellastessa lunghezza e codici operativi collocati sempre nella medesima posizione, perconsentire una piu rapida decodifica. Anche i modi di indirizzamento sono in numeroinferiore e piu semplici, ma nei RISC si possono sfruttare altre semplici operazioniper il calcolo degli indirizzi. I tipi di dati nativi implementati e ridotto al minimoma la loro gestione e resa piu versatile, nei RISC, infatti, non esistono operazioni ditrattamento delle stringhe ne altre istruzioni complesse; spesso le singole istruzionivengono eseguite in un solo colpo di clock. I processori RISC sono generalmenteforniti di un set di registri di uso generale, impiegabili in qualsiasi contesto, e cioconsente di impiegare i bit risparmiati mediante una codifica piu semplice per in-trodurre delle costanti nella stessa codifica delle istruzioni, riducendo il numero diaccessi alla memoria.

4

1 – Sintesi

Intorno agli anni ’80 si riteneva di aver raggiunto i limiti teorici nella velocitadi esecuzione dei processori, considerando i miglioramenti nelle tecnologie di fab-bricazione come uniche possibilita per incrementarla, grazie a transistor e linee diinterconnessione sempre piu piccole. In quel periodo emerse l’idea suddividere leunita di esecuzione in piu stadi e di inserire dei registri tra questi stadi per consen-tire di svolgere operazioni su piu istruzioni contemporaneamente. In questo modo,mentre un’istruzione viene caricata dalla memoria, un’altra puo essere decodifica-ta ed una terza eseguita, in tre unita diverse che lavorano tutte assieme. A partequestioni di latenza per il caricamento completo della catena, l’utilizzo di questa so-luzione tecnologica, detta pipeline, consente di accelerare l’esecuzione delle istruzionisfruttando a pieno le potenzialita di tutte le parti dell’architettura. La struttura ab-bozzata rappresenta solo un esempio, ma la tendenza e stata quella di avere pipelinedi lunghezza sempre maggiore, per ottenere sempre migliori prestazioni.Questa tecnologia presenta anche alcuni svantaggi dovuti al branch delay slot, ov-vero l’effetto sui tempi di un nuovo caricamento della catena di registri in presenzadi un salto condizionato, che non consente di sapere a priori quale sara la prossimaoperazione ad essere eseguita (branch hazard o control hazard). Per ovviare a questiinconvenienti si utilizzano varie tecniche quali branch target predictor, speculativeexecution e out-of-order execution, in maniera tale da poter prevedere la destinazio-ne di un salto o comunque poter eseguire altre istruzioni in attesa di sapere da qualepunto continuera l’esecuzione del programma, anche a costo di eseguire operazioniinutili. Appare evidente come questo tipo di approccio possa migliorare le presta-zioni ma sia inutilizzabile in applicazioni a basso consumo.Un’altra tecnologia molto diffusa e quella dei processori superscalari, in cui si in-tegrano piu unita di esecuzione, cercando di eseguire parallelamente in esse alcuneistruzioni consecutive. Anche tale soluzione ha una controindicazione, dovuta aidata hazard, ovvero la necessita di avere a disposizione dati che sono attualmentein esecuzione parallela all’interno di un’altra unita. Altro problema dei processoridotati di pipeline sono gli structural hazard, ovvero eventi in cui due o piu istruzioninecessitano di utilizzare le stesse risorse hardware in contemporanea. Nonostantele ragioni discusse, tutte le suddette tecniche vengono usate con successo su moltiprocessori e l’evoluzione dei processori si fonda proprio sulle tecnologie supersca-lari esplicite, ovvero quelle implementate a livello hardware ma gia supportate daicompilatori, che scelgono anticipatamente le istruzioni che possono essere eseguiteassieme e le inseriscono in una istruzione multipla.

Un’altra tecnica molto diffusa, che consente di ottenere miglioramenti nelle pre-stazioni di accesso alla memoria, e l’uso di cache, ovvero di piccole quantita dimemoria molto veloce che mappano regioni di memoria di sistema. Spesso i pro-cessori hanno a disposizione due tipi di cache, una per le istruzioni ed una per idati, all’interno delle quali vanno a ricercare la risorsa desiderata: nel caso essasia presente (cache hit) si ha una velocita di accesso molto ridotto, mentre in caso

5

1 – Sintesi

contrario (cache miss) il dato dev’essere preventivamente caricato in cache dalla me-moria centrale, con conseguente riduzione di prestazioni e consueti tempi di accessoalla memoria.L’approccio di progettazione RISC garantisce di poter sfruttare al meglio le sud-

Figura 1.2. Esempio di architettura di Harvard dotata di pipeline (DLX)

dette tecniche, consentendo di impegnare l’area occupata con grandi quantita dicache e registri della pipeline, ma anche altri registri di uso comune ed altre risorseintegrate quali controllori ed interfacce.

In talune circostanze le architetture RISC offrono vantaggi significativi rispettoai sistemi CISC e viceversa, di fatto la maggior parte degli attuali sistemi non puoessere definita totalmente RISC piuttosto che CISC e nel tempo i due approccisi sono evoluti l’uno verso l’altro, impedendo una chiara distinzione degli aspettiche li caratterizzavano inizialmente. I processori CISC sono basati sul principiodi ridurre il tempo impiegato a rintracciare le istruzioni da eseguire in memoria,concentrando le operazioni elementari all’interno di istruzioni complesse, anche sequeste richiedono piu cicli macchina per l’esecuzione. Un processore CISC ha leseguenti caratteristiche principali:

• Utilizza il microcodice residente in ROM interne per semplificare l’unita dicontrollo, evitando di eccedere nelle implementazioni in hardware.

• Tramite l’utilizzo di ROM interne per le istruzioni, di piu veloce accessorispetto alla memoria di sistema, migliora le prestazioni di esecuzione.

• Consente di avere set di istruzioni piu ricchi , con istruzioni di lunghezza varia-bile, semplici o di maggior complessita, con numerosi modi di indirizzamentoquando non e di tipo ortogonale.

6

1 – Sintesi

• Ha un minor numero di registri interni, considerato il fatto che le istruzionipossono operare direttamente sulla memoria senza necessita di memorizzarnegli indirizzi.

Tra le proprieta dei processori RISC spiccano:

• Utilizza un set di istruzioni semplificato per migliorare le prestazioni con unaarchitettura piu semplice.

• La maggior parte delle istruzioni viene eseguita in un solo ciclo macchina.

• Utilizza tecniche di pipelining, pre-fetching e di speculative execution.

• L’interfacciamento con la memoria utilizza solo due istruzioni e quindi mecca-nismi piu semplici.

• Garantisce maggiori prestazioni nei calcoli in virgola mobile.

• Ha un ridotto numeri di modi di indirizzamento.

• Dispone di un elevato numero di registri.

• Non utilizza microcodice, esegue le istruzione tramite unita hardware cablate.

• Fa carico al compilatore di ridurre al minimo la complessita della applicazionisoftware.

Il numero di istruzioni dei RISC garantisce una piu rapida e semplice procedura didecodifica rispetto a quello dei CISC, inoltre alcune statistiche sostengono che i com-pilatori utilizzano solo il 20% circa delle istruzioni di questi ultimi. Anche il fatto diavere istruzioni di lunghezza variabile, che vengono eseguite su vari cicli macchinarappresenta spesso un problema, mentre molte delle istruzioni dei RISC vengonoeseguite in un unico ciclo macchina. Purtroppo anche i processori RISC hanno alcu-ni difetti, infatti i programmatori devono porre notevole attenzione nello schedulingdelle istruzioni, onde evitare che il processore debba perdere cicli macchina in at-tesa di istruzioni da eseguire; sempre in ragione dello scheduling delle operazioni ildebugging risulta piu complesso rispetto ad un CISC. Dal punto di vista dell’occu-pazione d’area e evidente che un CISC puo comportare problemi realizzativi rispettoad un’architettura piu semplice ed anche i tempi di sviluppo possono avvantaggiare iRISC che, con cicli di progettazione piu brevi, possono sfruttare processi tecnologicipiu recenti, piu efficienti e meno costosi. Un ulteriore enorme vantaggio dei pro-cessori RISC e il basso consumo di potenza che, legato sempre questioni di ridottacomplessita interna, ha consentito a questi prodotti di conquistare ampiamente ilmercato dei prodotti automotive e degli apparati portatili.

7

1 – Sintesi

1.3 Architettura del microprocessore ARM7

Il microprocessore AR7TDMI e un membro della famiglia di processori ARM cheutilizza un’architettura RISC a 32-bit, la cui struttura interna rispecchia quella diVon Neumann. Esso integra una pipeline a tre stadi, che garantisce il funziona-mento continuo ed ottimale di tutte le sue unita interne. Il suo set di istruzioni edil relativo meccanismo di decodifica sono molto semplici rispetto a sistemi micro-programmati quali i processori CISC e questa minor complessita si traduce in unanotevole velocita di esecuzione delle istruzioni e di risposta agli interrupt, rendendoil processore adatto anche alle applicazioni real-time quali DSP e controlli automa-tici. Il processore ARM7TDMI, grazie alle sue prestazioni ed alle sue caratteristichedi basso consumo, ha beneficiato di un’ampia diffusione nell’ambito delle applica-zioni embedded ed e integrato in molti dispositivi portatili commerciali, dove taliaspetti si rivelano fortemente critici. Intere famiglie di microcontrollori sfruttano lepotenzialita del processore ARM7TDMI, la cui bassa complessita interna si traducein minori costi di fabbricazione rispetto a piattaforme equivalenti.

Una caratteristica particolare di questa versione del processore, benche non im-plementata nel modello costruito, e la micro-architettura Thumb, ovvero una por-zione dell’architettura a 32 bit che implementa un set di istruzioni a soli 16 bit ilcui comportamento e equivalente ad alcune istruzioni appartenenti al set di istru-zioni completo dell’ARM. Questa soluzione consente di ottenere un’elevata densitadi codice mantenendo molte delle prestazioni del processore ed e una caratteristicaunica della famiglia di processori ARM a partire dal modello in questione. Il siste-ma di decodifica del set di istruzioni Thumb consente di realizzare una traduzionedinamica immediata verso il set di istruzioni completo dell’ARM ed all’interno delmedesimo programma sorgente possono essere utilizzati entrambi i set di istruzioni,che il processore e in grado di eseguire cambiando la sua configurazione interna. Perpassare da uno stato interno all’altro i due set di istruzioni forniscono una particola-re istruzione denominata branch and exchange; essa consente, tra l’altro, di saltaread altre parti del codice in esecuzione.

Il processore consente di operare su dati di differente dimensione, oltre alla wordstandard (32 bit), infatti, possono essere gestiti dati di ampiezza pari al singolo byte(8 bit) o ad una halfword (16 bit) e possono anche essere specificati dati numerici cono senza segno (in notazione complemento a 2). L’organizzazione della memoria puoessere di tipo big endian o little endian ed il processore e dotato di un’interfaccia dimemoria in grado di gestire sia memorie statiche che dinamiche in maniera flessibile,anche sfruttando risorse di tipo misto all’interno della stesso sistema.Il processore puo funzionare in differenti modalita oltre a quella per la comuneesecuzione del programma utente, per consentirne la gestione di vari eventi qualiinterrupt e condizioni di errore che possono verificarsi durante il funzionamento. Perquesta ragione sono previsti sette differenti modalita operative di cui sei privilegiate,

8

1 – Sintesi

tra le quali una modalita supervisor dedicata al sistema operativo.Il processore e dotato di ben trentasette registri, trentuno dei quali sono registrigeneral purpose, mentre gli altri servono per memorizzare lo stato interno del sistema(processor status register). Questi registri non possono essere utilizzati tutti assieme,nella modalita di esecuzione normale solo sedici di essi (uno dei quali e il programcounter) sono accessibili, mentre una serie di banked register sono visibili solo quandoil processore opera nella altre modalita sopra discusse. ARM7 prevede due diversemodalita per servire degli interrupt, uno espressamente creato per consentire rapidicambi di processo in esecuzione. Per evitare di dover memorizzare il contenuto dimolti registri in memoria, prima di effettuare il cambio di modalita operativa versoquella di Fast Interrupt Request (FIQ), essa puo beneficiare di una serie di bensette registri dedicati. A parte la gestione degli eventi di interrupt e fast interruptsono previste le modalita di gestione di instruction e data abort, per gli errori nellagestione delle operazioni sulla memoria, una modalita per la gestione delle istruzioniindefinite e l’interfacciamento con i coprocessori ed una modalita di interrupt allaquale e possibile accedere tramite un’istruzione software (SWI). Per ognuna di questeesistono dei registri riservati per il salvataggio dell’indirizzo contenuto nel programcounter, per quello dell’istruzione cui puntare al ritorno della subroutine di gestionedell’eccezione e per la memorizzazione dello stato del processore.

Al fine di velocizzare alcune operazioni che il processore svolgerebbe in tempi nonottimali, e possibile avvalersi di un’unita di calcolo esterna, collegata sul medesimobus dati utilizzato dall’ARM per comunicare con la memoria centrale. Tramite taleconfigurazione il coprocessore puo accedere indirettamente alla memoria, anche se eil processore a dover provvedere al suo indirizzamento. Nel caso in cui il processorenon riconosca un’istruzione letta dalla memoria, esso la inoltra alla rete di copro-cessori tramite una terna di segnali di handshaking e se uno di questi e in gradodi riconoscerla ed eseguirla il calcolo parte parallelamente ad altre operazioni che ilprocessore svolge in attesa che il coprocessore fornisca i risultati desiderati. In casocontrario, se nessuna unita esterna riconosce l’istruzione, essa risulta indefinita ed ilprocessore entra nello stato undefined eseguendo il relativo exception handler.

Quando il processore opera nello stato ARM, esso legge dalla memoria istruzionia 32 bit e le decodifica rispetto al set di istruzioni completo di cui dispone. Tutte leistruzioni del set ARM sono eseguite in maniera condizionale, ovvero ognuna di essecontiene un campo di quattro bit che esprime una certa condizione relativa ai valoridei quattro flag contenuti nel PSR e solo se tale condizione e rispettata l’istruzioneviene realmente eseguita. Il PSR contiene quattro flag di uso comune nei processori:il carry flag (C-bit), il negative flag (N-bit), lo zero flag (Z-bit) ed un bit che indicala condizione di overflow (V-bit). Tutti questi bit vengono impostati in relazioneall’ultimo risultato ottenuto da un’operazione eseguita dalla ALU o comunque inrelazione all’esito dell’ultima istruzione che ne ha richiesto l’aggiornamento.

Le istruzioni dell’ARM set possono essere raggruppate nel seguente modo:

9

1 – Sintesi

Figura 1.3. Schema del core ARM7TDMI

• Branch: si occupano di realizzare salti condizionati e non, verso porzioni dicodice che rappresentano sottoprocedure; l’istruzione BX consente anche ilcambio dell’instruction set di riferimento.

• Data processing: eseguono le comuni operazioni di ALU tra due operandi,quali somme e sottrazioni con e senza segno, operazioni booleane bit a bit,operazioni di mascheramento e di confronto.

• Moltiplicazione: eseguono il prodotto di due operandi a 32 bit con rispetto delsegno dei medesimi e fornendo il risultato su 32 o 64 bit.

• Single data transfer: si occupano di traferire valori tra la memoria ed i registriinterni, con eventuale estensione del segno per tipi di dati di ampiezza inferiorealla word da 32 bit.

10

1 – Sintesi

• Block data transfer: consentono la gestione degli stack in memoria, permetten-do di traferire un sottoinsieme o l’intero set di registri interni del processore.

• PSR transfer: forniscono delle funzionalita per la modifica ed il trasferimentodei registri di stato (PSR) verso la memoria e viceversa; anche i soli flag relativialle operazioni dell’ALU possono essere modificati.

• Coprocessore: consentono la comunicazione tra il processore ed i coprocessoriad esso collegati e di eseguire operazioni di accesso alla memoria da parte deicoprocessori.

Oltre alle istruzioni elencate esistono la succitata istruzione per l’accesso agli inter-rupt software ed una operazione di scambio tra un dato contenuto in memoria edun registro. Quest’ultima istruzione (swap) e essenziale per la gestione di particola-ri variabili protette dette semafori, necessarie per l’implementazione di un sistemaoperativo. Essa consiste in una operazione di lettura da memoria seguita da unascrittura, che sono pero conseguenti ed inscindibili per evitare che altre periferichepossano accedere alla memoria modificandone il contenuto. Affinche cio non accada,il sistema di gestione della memoria viene avvisato dell’esecuzione di un’operazionedi swap tramite un apposito segnale (LOCK).

Le istruzioni di branch consentono di esprimere l’indirizzo di destinazione inmaniera assoluta tramite un registro che lo contiene o relativamente al valore delprogram counter (PC-relative) tramite un valore di offset immediato. L’esecuzionedi un’operazione di branch implica sempre il flush dei registri della pipeline cheprecedono lo stadio di execute, per evitare l’esecuzione di istruzioni non desiderateprima dell’effettiva realizzazione del salto alla subroutine.

Le istruzioni di data processing eseguono le operazioni riportate in tabella 1.1.Le operazioni accettano differenti sintassi e numero di operandi a seconda del tipoed un campo di 4 bit nella loro codifica consente di selezionare l’operazione desi-derata. Esse accettano come primo operando un registro e come secondo operandoun altro registro oppure un valore immediato, il quale viene memorizzato in ma-niera particolare all’interno della codifica dell’istruzione. L’assembler ARM, infatti,accetta solo quelle particolari costanti che possono essere ottenute effettuando larotazione a destra di un valore immediato memorizzato su 8 bit, per un numerodi bit pari al doppio di un ammontare memorizzato su 4 bit. In questo modo epossibile memorizzare solo alcune costanti su 32 bit, tra le quali tutte le potenzedel due. L’operazione di conversione viene eseguita dal barrel shifter che poi for-nisce il valore dell’operando alla ALU per l’esecuzione dell’istruzione designata. Ilsecondo operando puo anche essere espresso tramite un’operazione di barrel shifterda eseguire su un valore registrato ed il numero di bit per il quale eseguire lo shifto la rotazione puo essere espresso tramite un valore immediato oppure tramite il

11

1 – Sintesi

Tabella 1.1. Operazioni di data processingMnemonico OperazioneAND operand1 AND operand2EOR operand1 EOR operand2SUB operand1 - operand2RSB operand2 - operand1ADD operand1 + operand2ADC operand1 + operand2 + carrySBC operand1 - operand2 + carry - 1RSC operand2 - operand1 + carry - 1TST as AND, but result is not writtenTEQ as EOR, but result is not writtenCMP as SUB, but result is not writtenCMN as ADD, but result is not writtenORR operand1 OR operand2MOV operand2 (operand1 is ignored)BIC operand1 AND NOT operand2 (Bit clear)MVN NOT operand2 (operand1 is ignored)

contenuto di un registro. Per esprimere l’operazione di shift desiderata e possibileutilizzare uno mnemonico nella sintassi assembly e le operazioni permesse sono:

• Shift logico o aritmetico a sinistra (LSL o ASL).

• Shift logico a destra (LSR).

• Shift aritmetico a destra (ASR).

• Rotazione a destra (ROR).

Quando l’ammontare dello shift da eseguire e espresso tramite un registro alcuneoperazioni di barrel shifter particolari possono essere eseguite. Esse sfruttano alcunecodifiche ridondanti quali tutte quelle per un ammontare nullo e la loro selezioneavviene in base al valore contenuto negli 8 bit meno significativi del registro sud-detto. Le istruzioni di data processing possono richiedere o meno l’aggiornamentodei relativi flag del PSR ed ogni qualvolta il registro di destinazione e il programcounter (R15), un’operazione di flush della pipeline viene effettuata, come se fossestata eseguita un’operazione di branch.

Le istruzioni di moltiplicazione sfruttano un moltiplicatore veloce 32x8 bit in-tegrato nell’architettura, che consente di eseguire il prodotto di due operandi a 32bit con segno (in complemento a due) o senza, per ottenere un risultato su 32 o64 bit dopo al massimo quattro cicli macchina. Per l’esecuzione del prodotto viene

12

1 – Sintesi

utilizzato l’algoritmo di Booth a 8 bit e le somme parziali vengono sommate in unoo due registri ad opera della ALU. Se alcune sequenze dei bit piu significativi delmoltiplicatore sono tutti pari ad uno o sono tutti nulli, l’istruzione impiega menocicli macchina per l’esecuzione ed il controllo viene ceduto all’istruzione successivaappena calcolato il risultato.

Le operazioni di accesso alla memoria consentono il trasferimento di dati di am-piezza pari a 32, 16 e 8 bit, con eventuale estensione del segno per i tipi di dati diampiezza inferiore alla word. Esse accettano i modi di indirizzamento pre-indexinge post-indexing ed accettano un offset immediato o contenuto in un registro che puoessere aggiunto o sottratto all’indirizzo di base contenuto in un registro. L’aggiorna-mento dell’indirizzo iniziale puo essere aggiornato tramite l’operazione di writebacke per il calcolo dell’offset possono essere usate le operazioni di barrel shifter gia visteper le operazioni di data processing, salvo quelle che prevedono la memorizzazionedel numero di bit di cui effettuare uno shift all’interno di un registro. Le operazio-ni di trasferimento di piu registri in memoria (block data transfer) permettono larealizzazione di stack in memoria ed accettano tutti i modi di indirizzamento ap-pena discussi, sicche lo stack medesimo puo crescere o decrescere in una direzionepiuttosto che nell’altra nello spazio di indirizzamento. Tali operazioni accettano unlista dei registri da traferire che puo essere espressa in maniera molto flessibile nellasintassi assembly e consentono anche il writeback dell’indirizzo nel registro usatocome base.

Le istruzioni dell’ARM vengono eseguite in tempi differenti a seconda del tipo edel modo in cui sono espressi gli operandi, mentre per le operazioni di trasferimentodi blocchi di registri il numero di cicli macchina dipende dal numero di registri datrasferire.

Il processore ARM e dotato di una interfaccia di debug basata sullo standarddenominato Boundary Scan (IEEE Std. 1149.1/1990), costituito da una catena diregistri multifunzione (scan cell) collegati a monte degli ingressi ed a valle delleuscite del core. Tali registri consentono di forzare dei livelli logici sugli ingressi edi campionare le uscite secondo tempi programmati tramite il TAP controller eduna serie di segnali di controllo accessibili dall’esterno, tra i quali un ingresso edun’uscita seriale. Altra caratteristica delle celle dalla catena e la possibilita di ri-sultare trasparenti per consentire il normale funzionamento del sistema, come sela rete di debug non esistesse. La rete di debug fornisce avanzate caratteristicheper il monitoraggio e la correzione degli errori nelle fasi di sviluppo di applicazioni,sistemi operativi e di sistemi integrati che includono il core ARM7. Le estensio-ni hardware consentono di bloccare l’esecuzione del programma in occasione dellalettura di una specifica istruzione o di un particolare dato contenuto in memoria,ma anche in maniera asincrona tramite un segnale di debug request. Entrando nellamodalita di debug lo stato interno del core puo essere esaminato approfonditamen-te tramite l’interfaccia JTAG ed e poi possibile ritornare alla normale esecuzione

13

1 – Sintesi

del programma. In aggiunta al sistema descritto il processore ARM7TDMI e forni-to del modulo EmbeddedICE (o ICEBreaker) ovvero un’altra estensione hardwareper il debug consistente in una coppia di unita di osservazione in tempo reale, chepossono accedere alle varie risorse del processore per controllarne il corretto funzio-namento. Tale modulo utilizza come canale di comunicazione con l’esterno gli stessibus utilizzati dai comuni coprocessori, utilizzando un identificatore (CP#) ad essoriservato.

1.4 L’ambiente di sviluppo LISATek

La diffusione dei dispositivi elettronici in vari aspetti della vita comune ha profon-damente cambiato molti vincoli della produzione industriale, in particolare i tempidi sviluppo di un nuovo prodotto devono essere il piu possibile ridotti, per garantireil ritorno economico auspicato. D’altro canto la tecnologia dei semiconduttori haaperto nuovi orizzonti e la richiesta di nuove e piu potenti applicazioni ha condot-to ad una sempre maggior complessita dei sistemi digitali integrati. La tecnologiaSystem-on-Chip (SoC) ha consentito di produrre sistemi composti da svariati coretra loro intercomunicanti a bordo del medesimo chip e partendo dal presupposto chequesto tecniche richiedono intensi cicli di sviluppo, la produttivita dei progettisti edivenuta un fattore di vitale importanza per ottenere prodotti di successo. Per le ra-gioni esposte, l’idea di implementare tramite potenti algoritmi funzioni tipicamentesvolte da sistemi integrati e DSP, riducendo la complessita intrinseca dello sviluppodell’hardware, ha condotto al passaggio dai sistemi puramente hardwired all’inclu-sione di core programmabili all’interno dei SoC. Questa strategia rappresenta unapproccio innovativo rispetto alle tecnologie esistenti, noto con il nome di Applica-tion Specific Instruction-set Processor (ASIP) design. Tra gli applicativi softwareutilizzati a questo scopo, la toolsuite LISATek introduce notevoli vantaggi dal puntodi vista del risparmio di tempo e risorse, rendendo automatiche una serie di pro-cedure fino ad ora svolte in maniera essenzialmente manuale. Gli strumenti offertila LISATek, infatti, consentono di ottenere sia un’implementazione hardware delprocessore che degli strumenti di sviluppo software quali il simulatore e la toolchaincompleta di compilatore per il linguaggio C.

Il flusso di sviluppo di un ASIP prevede le seguenti fasi principali:

• Esplorazione dell’architettura.

• Implementazione dell’architettura.

• Creazione della toolchain e produzione degli applicativi software.

• Integrazione del sistema e verifica.

14

1 – Sintesi

Nel corso della prima fase vengono analizzati gli algoritmi da eseguire sul proces-sore, per stabilire quali caratteristiche deve avere l’architettura e le unita di esecuzio-ne in essa integrate. In questa fase e necessario avere a disposizione uno strumentosoftware in grado di simulare il comportamento del processore, oltre a degli applica-tivi che consentano di definire il giusto profilo sia per il software che per l’hardwarenecessari a realizzare l’applicazione desiderata. LISATek non fornisce strumenti disupporto per l’esplorazione della caratteristiche di una nuova architettura, ma unabuona base e spesso rappresentata da una semplice architettura di partenza, dotatadi un set di istruzioni minimale, su cui effettuare le dovute modifiche e miglioramentiper ottenere un oggetto sempre piu simile al risultato auspicato.

Una volta noto il comportamento che l’architettura deve avere, e possibile pro-cedere alla fase implementativa, descrivendo le funzionalita del processore trami-te linguaggi di descrizione dell’hardware come Verilog o VHDL. Questa fase vienesvolta spesso manualmente nell’approccio classico ed un notevole problema e rap-presentato dalle verifiche di consistenza tra il comportamento del simulatore e quellodell’hardware descritto.

La fase di creazione degli strumenti di sviluppo software consiste sostanzialmentenel mettere a disposizione dei programmatori compilatori per linguaggi ad alto li-vello, assembler e linker per la generazione di programmi eseguibili sull’architetturaimplementata, per certi versi strumenti analoghi a quelli usati nella fase iniziale diesplorazione.

Nella fase di integrazione e di co-simulazione del sistema invece, il sistema descrit-to in uno dei linguaggi HDL ed il simulatore vengono analizzati e collaudati assieme,per verificare che il loro comportamento sia il medesimo in ogni caso, ovvero che ilmodello sia consistente.

A causa dei continui raffinamenti, che possono rendersi opportuni nel corso diognuna delle fasi di sviluppo, simulatore, tool di supporto alla produzione del soft-ware e descrizione HDL devono essere rivisti a piu riprese ed in maniera tale darispettare i vincoli di compatibilita tra i vari livelli di astrazione, con tutte le con-seguenze in termini di tempo ed energie impegnate.La toolsuite LISATek consente di ridurre notevolmente la mole di lavoro necessariaper sviluppo di una nuova architettura, attraverso un approccio che ne consentela riprogettazione e l’affinamento in modo rapido ed efficiente, descrivendo sia lapiattaforma che il set di istruzioni in un’unica soluzione. Il linguaggio di descrizioneLISA, ovvero Language for Instruction-Set Architectures, ha come scopo la genera-zione automatica di codice HDL sintetizzabile e di tutti i tool per lo sviluppo delsoftware. Essa consente di descrivere in maniera approfondita il set di istruzioni ed ilmodello funzionale e comportamentale del processore, includendone tutti gli aspettisequenziali della logica che lo implementa. La collezione di tool generati includecompilatore C, assembler, linker ed un potente simulatore. Per quanto concerne lagenerazione della descrizione dell’hardware, LISATek e in grado di fornire sia codice

15

1 – Sintesi

VHDL che Verilog, ed anche i file di configurazione per il testing con i piu notiapplicativi di simulazione e sintesi HDL.

Il modello di un’architettura, descritto tramite il linguaggio LISA, e compostodalle seguenti parti:

• Modello della memoria

• Modello delle risorse

• Modello del set di istruzioni

• Modello comportamentale

• Modello temporale

• Modello della micro-architettura

Il modello della memoria descrive sostanzialmente gli elementi di memoria di cuiil processore e dotato, ovvero registri generali e dalla pipeline, RAM, cache, segnaliinterni, flag, bus e tutti i parametri relativi a queste entita. Il modello delle risorsedescrive la disponibilita delle risorse in relazione all’uso che le operazioni LISA fannodegli elementi del modello della memoria. Esso e costruito valutando tempi e modidi accesso in relazione allo scheduling delle operazioni LISA e tale modello vieneutilizzato anche nella generazione della descrizione HDL per risolvere i conflitti chesi verificano nelle assegnazioni dei vari segnali interni.

Assembler, disassembler e instruction decoder possono essere generati medianteil modello del set di istruzioni; queste caratteristiche del modello vengono descritteattraverso due sezioni specifiche delle operazioni LISA, che legano la sintassi assem-bly con la relativa codifica. In questa sezione sono definiti i vari codici operativi(opcodes), gli operandi, le notazioni per i valori immediati, etc..

Il modello comportamentale descrive il comportamento dell’architettura attra-verso il microcodice scritto all’interno delle operazioni LISA; il codice viene scrittomediante istruzioni in linguaggio C standard, con ulteriori arricchimenti tipici dellinguaggio LISA.

Il modello temporale descrive il comportamento dell’architettura nel tempo, so-prattutto per quanto riguarda il funzionamento della pipeline e lo scheduling delleoperazioni LISA che eseguono le istruzioni assembly. Il linguaggio LISA consente diutilizzare delle specifiche funzioni che rendono semplice ed efficace la gestione deglieventi della pipeline. Il modello della microarchitettura, infine, consente di descri-vere una struttura gerarchica della descrizione hardware in codice HDL, attraversoun raggruppamento delle operazioni LISA eseguito dall’utente che si traduce nellagenerazione di file separati per le varie parti dell’architettura.

A supporto della fase di descrizione in linguaggio LISA, la toolsuite fornisce iseguenti applicativi:

16

1 – Sintesi

• Processor Designer

• Instruction-set Designer

• Syntax Debugger

Il Processor Designer e essenzialmente un editor per la scrittura del codice LISA,benche consenta di controllare le varie parti del progetto, ovvero i vari file che locompongono. Tramite la sua interfaccia grafica e possibile impostare varie opzioniper i tool di generazione degli strumenti di produzione del software e della descrizio-ne dell’hardware HDL. L’Instruction-set Designer, invece, e uno strumento graficodedicato alla descrizione del set di istruzioni e rappresenta un’alternativa alla scrit-tura manuale del codice appartente alle preposte sezioni dei file LISA. Esso e moltoutile per avere una visione grafica dei vari campi della codifica delle istruzioni e puoessere utile per fare emergere inconsistenze di modello. Il Syntax Debugger, infine,consente di eseguire piccole porzioni di codice assembly per testare il set d’istruzionidescritto, seguendo passo per passo il processo di decodifica eseguito dalle istruzioniLISA scritte nei vari file.

La descrizione hardware dell’architettura viene generata tramite il Processor Ge-nerator, che analizza i file LISA per definire gli oggetti da istanziare, quali risorsedel processore, segnali interni per dati e controllo, pipeline, memorie, porte d’in-gresso e di uscita. Analizzando il raggruppamento delle operazioni all’interno dellesingole unita, consente di generare il decoder delle istruzioni e la logica combinatoriache realizza i singoli stadi di esecuzione delle istruzioni, quali ALU, barrel shifter,moltiplicatori, etc..

LISATek genera in automatico i seguenti tool per lo sviluppo delle applicazionisoftware:

• Assembler

• Linker

• Disassembler

• C Compiler

• Simulatore

L’assembler generato da LISATek, consente di elaborare il codice assembly e tra-sformarlo in codice oggetto da passare al linker. Il tool viene generato sulla basedella descrizione LISA del set di istruzioni riportata nelle varie sezioni SYNTAXe CODING. Oltre alle operazioni specifiche del processore modellato, l’assemblergenerato accetta una serie di pseudoinstruzioni o direttive utili per controllare laprocedura di assemblaggio del codice e l’inizializzazione dei dati e del codice che

17

1 – Sintesi

l’architettura dovra elaborare. Il tool integra anche alcune funzionalita tipiche deimacro-assembler. Dopo la fase di assemblaggio, il codice oggetto contiene un certonumero di simboli, ovvero di riferimenti a routine globali e/o locali memorizzate inmaniera non consequenziale in memoria. Per ottenere unico file eseguibile dai fileassemblati, e necessario risolvere questi riferimenti rintracciando le linee di codiceda collegare alla parte principale dell’applicazione. Questa operazione e svolta dallinker, che utilizza un file di configurazione specifico in cui l’utente deve riportarealcune informazioni relative alla gestione della memoria del processore. Il disassem-bler svolge il lavoro inverso rispetto all’assembler ed al linker, accettando un fileeseguibile in ingresso e fornendo in uscita un file assembly in cui si possono rileva-re delle differenze rispetto agli indirizzi ed ai simboli riportati nel file sorgente, inconsiderazione del fatto che il file disassemblato contiene riferimenti assoluti e nonrelativi. LISATek, per la generazione del disassembler, utilizza le stesse informazioniutilizzate per creare l’assembler sopra discusso.

Il Processor Debugger rappresenta lo strumento principe per la descrizione delcomportamento del processore, ovvero un simulatore generato automaticamente inC++ che, interfacciandosi con una GUI, consente di monitorare vari aspetti dellostato interno del sistema quali registri, pipeline, segnali interni, eventi ed anche lememorie ad esso collegate. Il simulatore consente di caricare la memoria dell’archi-tettura con un programma oggetto selezionato dall’utente, visualizzandone il codiceassembly, il disassemblato ed anche il microcodice LISA che descrive il processore.Esso consente di monitorare l’esecuzione del programma passo per passo, tramitecomandi di debug analoghi a quelli forniti dagli ambienti di sviluppo piu comuni.

Il compilatore C utilizza la tecnologia CoSy Compiler Development System, chesegue un approccio modulare basato un motore di compilazione per effettuare il par-sing del codice e l’analisi semantica dei file in linguaggio C forniti in ingresso, perl’ottimizzazione del formato intermedio del codice compilato e per la generazione delcodice eseguibile per il processore. Il compilatore accetta codice C standard ed adat-ta automaticamente il codice eseguibile generato alle caratteristiche ed alle risorsedell’architettura di riferimento. Per tale ragione, la generazione del compilatore Crichiede l’accurata definizione delle risorse del processore, quali registri utilizzabili,specifiche dei dati e del layout dello stack, direttive per lo scheduling delle operazio-ni ed altre definizioni e convenzioni del linguaggio di programmazione. Questa fasedi configurazione puo essere eseguita tramite il Compiler Designer contenuto nellatoolsuite.

La fase di integrazione e verifica del sistema include l’importante compito di va-lutare funzionalita dell’architettura quali tempistiche di esecuzione, area occupatasu silicio e consumo di potenza, per determinare quali parti dell’applicazione de-vono essere implementate in hardware e quali tramite software (hardware-softwarepartitioning). D’altro canto anche la bonta della descrizione hardware ottenuta de-ve essere verificata, quindi le tecniche e le interfacce di co-simulazione si rendono

18

1 – Sintesi

largamente utili anche per questi test, consentendo di integrare il modello hard-ware ed il simulatore software in un unico sistema di collaudo, in cui entrambi imodelli vengono stimolati mediante gli stessi pattern o gli stessi file eseguibili. Latoolsuite LISATek include la System Integrator Platform, che fornisce possibilita diverifica ed integrazione di processori embedded, memorie, e vari componenti in ununico sistema, anche di un intero SoC. Per consentire l’integrazione di componentisia hardware che software l’applicativo accetta vari linguaggi quali VHDL, Verilog,C/C++, e differenti formalismi descrittivi, anche forniti da altri noti tool.

1.5 Il modello LISA dell’ARM7

Il modello LISA dell’ARM7TDMI e stato costruito scrivendo in maniera essenzial-mente manuale i file in linguaggio LISA, senza usare strumenti grafici di supportoquali l’Instruction-set Designer sebbene, nella prima fase di descrizione della sintassiassembly e della codifica delle istruzioni, questo strumento si sia rivelato utile peravere una visione d’assieme dei singoli campi della codifica binaria. Per verificarela corretta decodifica delle istruzioni si e fatto anche uso del Syntax Debugger, mail Processor Designer ed il Processor Debugger sono stati gli strumenti principaliimpiegati nella costruzione del modello.

Il modello LISARM e strutturato raccogliendo le operazioni LISA che descrivonoil comportamento del processore in vari file, a seconda della funzione specifica cherealizzano. Esiste un file di riferimento per tutte le operazioni che contiene unaserie di definizioni di dati, costanti, maschere per operazioni aritmetiche e logiche,indirizzi degli exception handler e tipi enumerativi in C standard per rendere chiaroe semplice l’uso di valori caratteristici del processore modellato.

Il cuore funzionale del modello e rappresentato dal file main.lisa, che contienela descrizione di tutte le risorse del processore e le operazioni base eseguite ad ogniciclo macchina. Il file contiene la definizione della memoria centrale, dei registrigeneral purpose, dei registri di stato (anche in versione banked), degli stadi dellapipeline e dei rispettivi registri, delle porte di ingresso e di uscita, dei vari segnali edelle variabili interne. Per ciascun elemento qui definito e stato scelto il tipo di datopiu idoneo e si e fatto largo uso dei tipi di dati CXBit, ovvero di dati predefiniti dallinguaggio LISA che consentono di eseguire operazioni di estrazione ed assegnazionedi singoli bit o di regioni di bit. I dati di tipo CXBit sono particolarmente indicatiper eseguire conversioni verso altri tipi di dati e rendono flessibile il trattamento deivalori in essi contenuti.

La struttura della pipeline e stata definita assegnando i nomi dei suoi stadi,ovvero PF (prefetch), FE(fetch), DC(decode), EX(execute) e ED(execute-dummy);il primo stadio e implicito, considerato che l’ARM esegue l’operazione di prefetch,

19

1 – Sintesi

mentre l’ultimo stadio ha solo lo scopo di consentire l’uso delle operazioni di pol-ling. Le operazioni di polling sono operazioni LISA in grado di riattivare se stessenel ciclo macchina successivo e sono di utilita vitale per consentire la descrizionedelle istruzioni che vengono eseguite su piu cicli di clock successivi. Affinche l’o-perazione sia in grado di riattivarsi, pero, e necessario che lo stadio della pipelinesuccessivo rispetto a quello cui l’operazione appartiene sia in condizioni di stallo, daqui l’esigenza di introdurre questo stadio fittizio che non viene comunque generatoin hardware. I registri della pipeline sono definiti in un’unica soluzione per tuttigli stadi, sebbene le operazioni di sintesi dell’HDL generato eliminino automatica-mente tutti gli elementi le cui uscite non risultano collegate ad alcunche. Essi sonoutilizzati essenzialmente per trasferire la codifica binaria dell’istruzione letta dallamemoria verso lo stadio di decodifica e per trasmettere allo stadio di esecuzione unaserie di impostazioni che consentano al datapath di realizzare le operazioni richiestedall’istruzione in esecuzione. Una serie di segnali per il controllo della pipeline edelle unita di decodifica ed esecuzione vengono poi aggiunti al modello in manieraautomatica, grazie alle potenzialita degli strumenti di LISATek.

All’interno del file principale vengono poi definite le singole unit per il raggrup-pamento della operazioni che vanno mappate in hardware, per consentire il man-tenimento di una certa gerarchia del sistema. Ad ogni gruppo di istruzioni quicreato, corrisponde un file HDL separato, contenente un modulo (Verilog) o un enti-ty (VHDL) capace di eseguire tutte le operazioni LISA definite per esso. Nel modellogenerato, oltre alle unita di fetch e prefetch, sono definiti l’instruction decoder e ilcondition checker e tre unita separate per l’ALU, il barrel shifter ed il moltiplicatore.

Tutte le operazioni di esecuzione delle istruzioni sono raggruppate in relazionealle funzioni svolte, per cui branch, data processing, memory access, multiply sonole unita che si occupano di realizzare in hardware tali funzionalita. Per consentire lacorretta inizializzazione del processore, all’avvio ed in presenza di un evento sull’in-gresso predefinito di reset asincrono, e prevista un’operazione di reset, che eseguel’azzeramento di tutti i registri e l’impostazione del registro di stato secondo quantoprevisto dalle specifiche dell’ARM7.

Tramite l’operazione LISA denominata main vengono eseguite una serie di ope-razioni ricorrenti ad ogni ciclo macchina e la gestione di tutti gli eventi relativi allapipeline, quali lo scheduling delle operazioni di prefetch, fetch e decode. Tramiteessa viene controllata l’esecuzione delle operazioni LISA programmate e lo shift delcontenuto dei registri della pipeline, ma solo in assenza di eventi che ne richiedanolo stallo. Tale operazione realizza anche il controllo dello stato dei flag del PSRper l’esecuzione condizionale delle operazioni e gestisce direttamente alcuni segnalid’uscita per l’interfacciamento con la pipeline. Fondamentale, in questa operazioneLISA, la gestione delle eccezioni, eseguita monitorando gli ingressi di ABORT, IRQe FIQ e tutti gli altri eventi che possono richiedere un cambiamento della modalitaoperativa, compatibilmente con lo stato della pipeline e l’eventuale esecuzione di

20

1 – Sintesi

un’istruzione che si svolge su piu cicli macchina.L’operazione prefetch, invece, si occupa del caricamento delle istruzioni dal-

la memoria e della gestione dei salti (branch), tramite funzioni specifiche messe adisposizione dal linguaggio sia per l’accesso alla memoria che per lo stallo dellapipeline.

L’operazione decode rappresenta il punto iniziale del processo di decodifica delleistruzioni, essa osserva la codifica dell’istruzione binaria ed innesca l’esplorazionedi un albero di possibilita che di passo in passo attivano una serie di altre opera-zioni LISA, il cui codice (sezione BEHAVIOR) esegue lo scheduling e il setup perl’esecuzione dell’istruzione desiderata. Una porzione dell’albero di decodifica delmodello e riportato in figura 1.4. Il medesimo albero e percorso anche per la gene-

CODING ROOTmultiply_grp

MULMLA

MULLMLAL

data_proc_grp

arith_logic_grp

mov_grpcmp_grp

MVNMOV

EOR BIC

ADDSUB

ADCSBC

ORRAND

CMPCMN

TSTTEQ

LDMSTM

STRHLDRH

LDRSHLDRSB

su_data_grp

block_data_grpstd_data_grp

STRLDR

mem_access_grp

PSR_access_grp

MRSMSR

BBL BX

branch_grp

other_grp

SWPSWI

Figura 1.4. Albero di decodifica delle istruzioni

razione dei tool assembler, disassembler e compilatore C e, per meglio identificare leoperazioni che si occupano della decodifica delle istruzioni da quelle che ne descri-vono il comportamento all’interno dell’architettura, esse sono state indicate come<instruction> dc. Esse assegnano ai registri della pipeline una serie di informa-zioni quali gli indici dei registri riportati nella sintassi dall’istruzione, alcuni flage degli altri valori per consentire la corretta configurazione di ALU, barrel shifter

21

1 – Sintesi

e moltiplicatore. Queste operazioni effettuano anche l’attivazione delle operazionidi esecuzione (<instruction> ex) che, compatibilmente con gli eventi della pipe-line e con l’eventuale condizione espressa nella sintassi, vengono eseguite nel ciclomacchina successivo.

Per l’elaborazione dei dati il modello utilizza una struttura simile a quella delprocessore originale e lo stile di descrizione utilizzato mira ad ottenere gli stessicomponenti hardware stand-alone, ovvero la ALU, il barrel shifter ed il moltiplicatore32x8 bit. Essi hanno propri segnali di controllo e la rete di collegamento con i busdati rispecchia quella dell’ARM7. La struttura complessiva e indicata in figura 1.5.

Barrel Shifter

C_flagcarry_out

32 x 8

Multiplier

Register File

bs_carry_outC_flag

A_bus

B_bus

Figura 1.5. Struttura dell’unita di esecuzione

Il barrel shifter e descritto mediante una coppia di operazioni LISA principali. Laprima operazione (barrel shifter op dc) consente la decodifica delle operazionirichieste dalla sintassi assembly, ovvero shift aritmetico o logico a destra, shift asinistra, rotazione a destra e relativo ammontare in numero di bit. La seconda

22

1 – Sintesi

operazione (barrel shifter op dc) esegue le operazioni richieste mediante l’uso diistruzioni in linguaggio C. Allo scopo si sono dovuti utilizzare alcuni stratagemmiper adattare le istruzioni implementate dall’ARM a quelle che il C standard esegueed in considerazione del fatto che un ciclo“for”in LISA non puo puo essere ripetutoun numero di volte variabile, a causa dell’impossibile mappaggio in hardware, ladescrizione funzionale ridondante dell’operazione ROR risulta necessaria. Il barrelshifter descritto esegue anche le operazioni dotate di codifica particolare e quelle conammontare memorizzato in un registro, per configurare l’esecuzione delle quali sonostati previsti dei flag dedicati all’interno della pipeline.

L’unita di esecuzione aritmetico-logica (ALU) opera su due operandi a 32 bit,provenienti uno dal register file e l’altro dal barrel shifter ed esegue sostanzialmentele stesse operazioni previste dalle operazioni di data processing riportate in tabella1.1. L’operazione viene selezionata tramite un registro specifico della pipeline ed ilrisultato puo essere riscritto all’interno del register file oppure nel registro di indiriz-zamento della memoria, se e in esecuzione il calcolo di un indirizzo per un’istruzionedi trasferimento dati. Una serie di operazioni LISA si occupa di aggiornare ade-guatamente i flag del PSR in relazione al tipo di operazione eseguita ed al risultatoottenuto, ammesso che il suo aggiornamento sia richiesto dall’istruzione assembly.Sia la ALU che il barrel shifter usano un formato interno a 33 bit che consente digestire sia le operazioni di shift e di rotazione, che le operazioni di somma e sot-trazione, senza perdita di dati importanti quali riporti ed eventuali condizioni dioverflow, diversamente non ottenibili.

Per eseguire la moltiplicazione tra due operandi a 32 bit il processore sfrutta unmetodo analogo a quello dell’ARM, demandando al moltiplicatore veloce 32x8 bit ilcalcolo del prodotto di blocchi di 8 bit del moltiplicatore e dell’intero moltiplicando.Il linguaggio LISA non consente di descrivere in maniera approfondita il lavoro cheil moltiplicatore deve svolgere, quindi le istruzioni C impiegate esprimono solo ilprodotto di due operandi delle dimensioni sopra riportate, esprimendone il risultatosu un bus a 32 bit. Il modo in cui il sintetizzatore hardware potrebbe implementareil moltiplicatore non e noto a priori e questi aspetti appartengono ad una fase disviluppo successiva alla costruzione del modello funzionale. L’uscita del moltipli-catore e collegata al barrel shifter che, ad ogni ciclo, introduce degli shift di 8 bitper consentire all’ALU di calcolare le somme parziali. Dopo quattro cicli questeoperazioni forniscono il risultato della moltiplicazione, sia essa con o senza segno.

Per facilitare la stesura e la comprensibilita del codice, si e fatto uso di una seriedi semplici operazioni che effettuano conversioni di valori immediati di differenteampiezza in bit, di indici di registri per l’indirizzamento del register file, di valoricontenuti in registri ed anche della lista di registri per le operazioni di gestione dellostack. Tutte queste operazioni facilitano il riuso del codice LISA e quindi dellerisorse hardware. Esse sono raggruppate in unica unit per poterne identificare lefunzioni nell’HDL generato. Tra esse viene anche definita un’importante operazione

23

1 – Sintesi

che si occupa di effettuare il setup del datapath per la conversione degli operandiespressi tramite il formato immed8 r, come discusso nel paragrafo 1.3, ed anche lafunzione write result che si occupa di scrivere il risultato fornito dall’ALU nelregister file o nel registro di accesso nella memoria, a seconda del tipo di istruzionein esecuzione.

Il resto delle operazioni descrive sostanzialmente il comportamento delle istru-zioni, nel rispetto delle specifiche del data sheet dell’ARM7TDMI [16] sia per quantoriguarda la sintassi accettata, che per codifica e tempi di esecuzione. Come accen-nato, per ciascun gruppo di operazioni, esiste una prima operazione che si occupadi decodificare opportunamente la codifica binaria, salva le informazioni per la con-figurazione delle unita di esecuzione nella pipeline ed effettua lo scheduling delleoperazioni per il ciclo di clock che segue.

Le operazioni di branch (BX, B, BL) utilizzano il datapath in maniera diffe-rente per il calcolo dell’indirizzo di destinazione del salto ed anche per l’eventualesalvataggio dell’indirizzo di ritorno da subroutine se richiesto. Esse effettuano l’o-perazione di flush della pipeline per evitare comportamenti indesiderati del sistemae, se e previsto il link, sfruttano un’operazione di polling per eseguire la correzionedell’indirizzo di memoria salvato e riscriverlo nel registro dedicato.

Le operazioni di data processing eseguono ognuna una specifica operazione diALU, configurandola adeguatamente mediante una operazione <opcode> dc, atti-vata per eseguirne la decodifica, e convertendo il secondo operando nella manierarichiesta. Nel caso in cui il secondo operando dovesse richiedere un’operazione dibarrel shifter il cui ammontare e espresso in un registro, viene sfruttata un’opera-zione di polling per consentire l’accesso al registro stesso in un primo ciclo macchinae l’operazione dell’ALU in quello successivo. Essendo accettato come registro didestinazione anche il program counter (R15), se il writeback viene eseguito su diesso il flush della pipeline viene programmato dalle operazioni del gruppo.

Come accennato il processore e in grado di effettuare moltiplicazioni tra operandicon segno o senza segno andandone a scrivere il risultato su uno o due registridesignati. L’operazione viene svolta sfruttando il moltiplicatore 32x8 ed un certonumero di cicli macchina, dipendente da quanti gruppi di 8 bit tutti pari a zeroo a uno sono presenti nella parte piu significativa moltiplicatore. La ragione percui si possono risparmiare cicli e evidente, nel caso in cui ci fossero uno, due otre gruppi di 8 bit identici, sarebbe necessaria al massimo una semplice sommaper concludere il calcolo. Le operazioni LISA che descrivono le operazioni non fannoaltro che mascherare i gruppi di bit piu significativi e confrontarli per capire se hannotutti lo stesso valore e valutano l’operazione da eseguire; esse sfruttano le consuetefunzionalita offerte dal polling e stallano la pipeline per estendere il numero di ciclidi esecuzione dell’operazione.

Le operazioni di accesso e modifica dei PSR agiscono sullo stato corrente delprocessore o su quello salvato, nel rispetto della modalita di esecuzione attuale del

24

1 – Sintesi

processore, e consentono di cambiarne anche solo parzialmente il contenuto mediantefunzioni LISA particolarmente flessibili. Le operazioni LISA dedicate convertonoadeguatamente i valori immediati espressi nella codifica o nei registri sorgente edeseguono i mascheramenti dovuti per la corretta assegnazione dei singoli bit deiregistri di stato.

Le istruzioni di accesso alla memoria per il trasferimento di singoli dati utiliz-zano varie operazioni LISA per la decodifica ed il setup delle unita di esecuzione,che nel primo ciclo macchina devono eseguire il calcolo dell’indirizzo per l’accessoalla risorsa esterna. A tale scopo sono state definite operazioni per la decodifica deimodi di indirizzamento PC-relative, pre-indexed e post-indexed, come previsto dal-l’assembly dell’ARM. Per convertire i valori di offset immediati, espressi in formatoimmed8 r, si usano le stesse operazioni utilizzate dalle operazioni di data processing,sfruttando il riuso delle operazioni. Tali operazioni si svolgono necessariamente supiu cicli macchina consecutivi e sfruttano i meccanismi di polling programmandoun numero di cicli di stallo adeguato al tipo di operazione da eseguire, ovvero unsolo ciclo per la scrittura e due per la lettura di un dato. Se e richiesta l’opera-zione di writeback dell’indirizzo calcolato sul registro usato come base, essa vieneeseguita durante secondo ciclo sia per la scrittura che per la lettura, come previstodalle specifiche dell’ARM. Per le operazioni di scrittura viene effettuato il setup deisegnali necessari nel primo ciclo di esecuzione e l’accesso alla memoria avviene nelciclo macchina susseguente quello in cui si calcola l’indirizzo. Per il caricamento diun registro si trasmette l’indirizzo della locazione cui accedere nel secondo ciclo edil dato fornito dalla memoria viene campionato e scritto nel register file nel terzociclo macchina. Nel caso in cui il registro di destinazione di un’istruzione STR sia ilprogram counter, viene eseguita un’operazione di flush della pipeline come nel casodelle operazioni di branch.Le operazioni di questo gruppo gestiscono direttamente i segnali dell’interfaccia dimemoria e per richiedere l’accesso alla memoria utilizzano una funzione LISA analo-ga a quella utilizzata per il prefetch delle istruzioni. Per le istruzioni che trasferisconodati di dimensione byte o half-word viene aggiunto un segnale per indicare quale byteo halfword deve essere trasferita, in modo da adattare l’interfaccia del modello allespecifiche di ARM tramite un’unita esterna descritta nel paragrafo seguente (1.6).

Le istruzioni di gestione dello stack eseguono molte delle operazioni gia discusseper delle istruzioni di accesso alla memoria, ma utilizzano un particolare registroglobale interno a 16 bit che contiene i flag associati ciascuno ad un registro generaledel processore. L’operazione LISA di decodifica assegna a tale registro il valore dellaregister list contenuta nella codifica e nella fase di esecuzione si effettuano tanti ciclidi stallo quanti sono i bit ad uno, per effettuarne passo a passo il trasferimento.Dopo un primo ciclo in cui viene caricato dal registro base l’indirizzo per l’accessoalla memoria, infatti, si procede al trasferimento di un singolo dato, ricercando ilprossimo registro da traferire nel registro dei flag e programmando un successivo

25

1 – Sintesi

ciclo di stallo se la lista non e vuota.

L’istruzione di data swap e descritta mediante un’operazione polling che sostan-zialmente richiama le operazioni svolte da una operazione di scrittura in memoriaseguita da una lettura, l’unica differenza e che l’operazione pilota il segnale di LOCKper avvisare il sistema di gestione della memoria che l’accesso alla risorsa dev’esserenegato ad altre periferiche, finche il processore non ha terminato l’esecuzione dellaistruzione.

L’istruzione undefined prevede l’attivazione della procedura di handshaking coni coprocessori collegati e le sue istruzioni LISA controllano direttamente i segna-li dedicati, per stabilire se cambiare la modalita operativa del processore, salvarel’indirizzo di fetch attuale e memorizzare l’indirizzo dell’handler relativo nel pro-gram counter. Anche in questo caso si tratta di un’operazione che viene eseguitasu piu cicli, dovendo attendere un eventuale risposta dal coprocessore che riconosceun’istruzione destinata a se; solo nel caso nessun coprocessore riconosca l’istruzionel’undefined trap viene attivata.

L’istruzione di interrupt software, invece, attiva direttamente la modalita super-visor e come la precedente salva il program counter assegnandogli subito l’indirizzodell’exception handler.Ogni qualvolta un’istruzione prevede il flush della pipeline, trascorrono due ciclimacchina in cui il processore non esegue alcuna operazione, per consentire lo riem-pimento automatico dei registri e l’afflusso di una nuova istruzione nello stadio didecodifica prima e di esecuzione poi.

1.6 Strumenti di sviluppo per ARM7

La costruzione del modello dell’ARM7TDMI si e svolta attraverso due fasi principali:la creazione di un modello instruction-accurate e la sua estensione per l’ottenimentodi un secondo modello di tipo cycle-accurate. Nella prima fase si sono essenzial-mente descritte le proprieta della sintassi e della codifica delle istruzioni, ma anchele operazioni necessarie al controllo e all’esecuzione delle computazioni, senza fareriferimento alle tempistiche di esecuzione in pipeline o su vari cicli macchina. Solonella seconda fase si e provveduto a distribuire il codice scritto nei vari stadi dellapipeline ed a descrivere un comportamento del processore rispettoso delle tempisti-che di esecuzione espresse dalle specifiche dell’ARM. In entrambe le fasi il simulatore(Processor Debugger) si e rivelato di importanza vitale, al fine di controllare la con-sistenza ed il corretto funzionamento di ogni singola parte del processore. Grazie adassembler, linker e disassembler generati automaticamente e mediante la stesura diadeguati file sorgenti, e stato possibile analizzare in maniera approfondita risorse efunzionalita del modello, dai primi passi e fino al processore completo.

26

1 – Sintesi

A causa di un limite del linguaggio LISA e della toolsuite LISATek, l’inter-facciamento con la memoria non corrisponde esattamente alle specifiche dell’ARM,soprattutto per quanto concerne l’indirizzamento dei singoli byte. Sebbene il model-lo della memoria sia stato descritto per ottenere il caricamento di dati ed istruzionia 32 bit con sottoblocchi da 8 bit, affinche il processore esegua correttamente il cari-camento delle istruzioni, l’incremento imposto al program counter puo essere di unasola unita alla volta, ma questo non consente di accedere a dati di 8 e 16 bit. Perovviare a questo problema si e previsto l’uso di un wrapper esterno che si interfacciacorrettamente con i segnali del modello LISA ed in particolare e stato aggiunto unsegnale per la selezione del byte o half-word da trasferire (BS, byte select), che nonfa altro che comunicare all’esterno i due bit meno significativi del program counterinterno.Il wrapper, sfruttando questo dato, prende l’indirizzo della locazione di memoria

RAM

Memory Wrapper

SE

Q

nMR

EQ

nRW

MC

LK

data_busaddress_bus

MCLK

2BS

2MAS

2

MAS

Figura 1.6. Schema del wrapper per la memoria

cui accedere dal bus degli indirizzi e lo fa traslare a sinistra di 2 bit per inserirei bit del segnale BS; il risultato e un indirizzo che consente alla unita di gestionedella memoria di puntare il singolo byte. Affinche nelle operazioni di scrittura dibyte e half-word il dato venga replicato correttamente anche sugli altri slot di paridimensione, come previsto dall’ARM7, il wrapper collega le singole linee del busdati in modo selettivo, in relazione ai valori dei segnali BS e MAS. In questo modoanche le istruzioni di accesso ai dati di dimensione byte o half-word possono opera-re correttamente, rispettando le specifiche dell’interfaccia di memoria dell’ARM. Ilmemory wrapper puo anche fornire il supporto per l’organizzazione big endian dellamemoria, altrimenti non gestito dal modello LISA, mediante l’opportuno incrociodei singoli byte letti dalla memoria o sul bus dati del processore.

La diffusione del processore ARM7TDMI nel mercato degli apparati portatili e

27

1 – Sintesi

soprattutto dei microcontrollori, ha reso disponibili vari strumenti per lo sviluppodel software, quali compilatori C/C++ ed assembler, anche non legati al marchioARM Ltd.. Data la compatibilita del modello costruito con il core originale, talistrumenti possono essere utilizzati per generare i file eseguibili, grazie all’uso dellamedesima codifica delle istruzioni. I file assembly scritti per ARM, invece, devonoessere modificati per essere utilizzati anche con il processore generato, per alcunedifferenze legate a particolari metodi usati dall’assembler di ARM per codificare lecostanti a 32 bit e le liste dei registri da trasferire per le istruzioni di gestione de-gli stack. Per superare questi problemi e stato scritto un tool in linguaggio C cheanalizza il file assembly generato per ARM, rintraccia le istruzioni che fanno usodei valori immediati e le istruzioni di block data transfer e sostituisce la sintassipropria di LISARM. Il tool di pre-assemblaggio legge dal file sorgente il valore dellacostante da convertire e la carica internamente in un tipo di dato a 32 bit, su cuiesegue una serie di mascheramenti di gruppi di 8 bit contigui per capire se essa puoessere espressa o meno tramite un valore immediato su 8 bit su cui effettuare unarotazione a destra pari ad un certo ammontare. Nel caso sia possibile, dalla posizio-ne della maschera si deduce il numero di bit di cui e necessario ruotare il valore ei due dati vengono sostituiti nel file assembly originale come due valori immediati,uno esprimibile su otto bit e l’altro su quattro. Considerato che alcune operazioniquali ADD e SUB, ADC e SBC, AND e BIC, MOV e MVN, CMN e CMN eseguonooperazioni su dati complementati ad uno o a due e se e possibile esprimere l’oppostodella costante data secondo il formato immed8 r, allora la conversione viene eseguitae le operazioni vengono scambiate, come previsto dall’assembler ARM originale [19].Se la conversione non puo essere eseguita il tool restituisce un segnale di errore.Le istruzioni LDM ed STM accettano una lista di registri da trasferire espressain maniera esplicita, registro per registro, oppure tramite gruppi di registri inclusitra due loro specificatori separati da un trattino“-”. Considerato che non e possi-bile implementare direttamente una simile funzionalita nell’assembler generato daLISATek, il tool trasforma la lista contente gruppi di registri in una lista esplicita,con i nomi dei registri interessati separati da virgole.

Il VHDL generato dal HDL Generator e stato parzialmente verificato utilizzandoil simulatore ModelSim, per il quale il tool stesso genera i file di configurazione edil dump su file della memoria contenente il programma di test che si vuole caricare.Alcuni test preliminari hanno permesso di scoprire malfunzionamenti del modellodescritto e di eseguire le dovute revisioni. Il tool di generazione dell’HDL invece,attraverso numerose compilazioni della descrizione LISA, ha consentito di adattarelo stile di descrizione ad una visione piu vicina a quella dell’architettura hardware,grazie alle segnalazioni di errore e di warning fornite; tali inconsistenze sono moti-vate dai diversi livelli di astrazione che convivono all’interno del modello descrittotramite il linguaggio LISA.

28

1 – Sintesi

LISARM

assemblerassembler

pre−

LISARMLISARM

disassembler

LISARM

post−

disassembler

ARM

assembler

assembly files

C−compiler

C files

C libraries

binary file disassembly file

Figura 1.7. Diagramma della toolchain completa di LISARM

1.7 Conclusioni e sviluppi futuri

La toolsuite LISATek ed il linguaggio LISA hanno dimostrato le loro grandi poten-zialita durante tutte le fasi di sviluppo del modello, soprattutto per aver consentitodi concentrarsi su una serie di aspetti differenti nella prima fase di descrizione delmodello instruction-accurate ed in quella successiva per il modello cycle-accurate.Notevoli i meccanismi per la descrizione del set di istruzioni, che hanno permessodi suddividere la complessita il problema riducendola attraverso un certo numero disotto-fasi, anche dal punto di vista delle tempistiche e dell’esecuzione in pipeline.Sebbene sia necessario adattare il proprio stile di descrizione ad aspetti propri dellivello hardware e dei linguaggi di descrizione HDL, gli strumenti di sviluppo han-no consentito di ottenere una gerarchia dell’hardware generato tale da permetteresuccessive ottimizzazioni ad opera del sintetizzatore, sostituendo parti del datapathcon efficienti architetture di libreria.

Mantenendo la struttura del modello costruito, si puo ottenere un’architetturaadatta anche ad altre applicazioni specifiche, modificando semplicemente il set diistruzioni gia esistente. Grazie alla flessibilita della descrizione LISA ed alle oppor-tunita offerte da LISATek si possono produrre la descrizione HDL sintetizzabile, unsimulatore efficiente e i vari tool di sviluppo del software per una nuova architetturacon notevole risparmio di tempo ed energie. Nel modello descritto e stato ignoratol’interfacciamento con i coprocessori, in quanto l’intenzione non era tanto quella dicreare un clone di ARM7 ma ottenere un processore estensibile, che potesse svol-gere al suo interno anche istruzioni specifiche mediante l’incorporamento di nuove

29

1 – Sintesi

funzionalita. Per migliorare la densita del codice, invece, si potrebbe pensare diimplementare anche il set di istruzioni Thumb, sfruttando sia le potenzialita che illinguaggio LISA mette gia a disposizione che l’idea utilizzata dal processore ARM7 direalizzare una decodifica dinamica delle istruzioni a 16 bit in istruzioni appartenential set completo. Alcuni limiti del modello attuale, quali l’interfacciamento e l’indi-rizzamento della memoria o la gestione flessibile di bus di comunicazione, potrebberoessere superati grazie a future versioni della toolsuite, consentendo di renderlo com-pletamente compatibile con le specifiche dell’ARM. Un’altra valida idea potrebbeessere quella di trasformare l’architettura di Von Neumann attuale in un’architettu-ra di Harvard, ottenendo migliori prestazioni in esecuzione senza re-ingegnerizzaretutta la struttura interna. Mediante il linguaggio LISA, infatti, puo essere como-damente aggiunto un stadio nella pipeline e le operazioni di accesso alla memoriapossono essere spostate al suo interno con grande semplicita. Una fase intensiva ditest dev’essere eseguita sul modello prodotto, per garantirne il funzionamento di ognisua parte in qualsiasi condizione operativa; per fare cio il tool di co-simulazione diLISATek consente di verificare contemporaneamente simulatore e descrizione HDL,utilizzando anche gli stessi pattern prodotti durante lo sviluppo del modello.

30

Chapter 2

The RISC microprocessorarchitecture

This chapter introduces some concepts about computer architecture and reportssome historical outlines about the evolution of computers. Starting from funda-mental approaches like Von Neumann and Harvard architectures, processor growingcomplexity is examined, underlining the reasons of some manufacturing trends andmarketing strategies, until the revolution introduced by RISC architectures. A par-ticular attention is reserved to the latter design approach, in order to understandsome of the reasons that have made this architecture the leader in embedded coresand mobile devices market. Moreover, some important improvements produced bypipeline, cache technologies and parallel execution introduction are discussed. Atthe end of the chapter, some of the most recent projects are presented and a balanceof the various modern design approaches is traced.

2.1 The Von Neumann architecture

Since the first steps in computer modeling, the simplest structure used was theVon Neumann architecture[1], so called stored-program architecture; it is a very sim-ple design that uses a unique storage structure to hold both data and code. TheVon Neumann architecture has revolutionized the computer concept introducingthe instruction set architecture (ISA) idea. Before this formalization, the availablecomputers worked with a fixed program, executing a sequence of unchangeable op-erations on the data, although these operations were not specified by opcodes1. Withthe introduction of the instruction set architectures and of the processor program-ming model, opcodes, native data types, processor resource references and addressing

1an opcode is the portion of a machine language instruction that specifies the operation to beperformed, the term stands for “operation code”.

31

2 – The RISC microprocessor architecture

modes were formalized via instructions written in machine language. The sequenceof instructions describe step by step the operations the processor has to perform,to produce the desired computation; these instructions represent the program tobe executed. The Von Neumann architecture is more flexible with respect to itsancestors, in fact the program is stored in a memory and can be modified to achievethe user needs without re-design or re-structure the processor architecture.

Since the instructions can be treated as data, the Von Neumann architecturecan also modify the program itself. This characteristic was useful for first platformswhich did not support memory addressing via index registers or indirect addressingtechniques. Self-modifying programs are deprecated today, because this kind ofapplications are very difficult to debug and have low efficiency on pipelined andcached processors (Lisp2 HLL represent an important exception, anyway). Self-modifying programs, moreover, could have harmful effects on the whole system,because in case of errors the program can damage itself, other programs stored inmemory and also some operating system procedures. Another risk is representedby malware programs, which try to broke some software structures to modify otherprograms and data, to compromise files stored on the system or crash it. To avoidthese effects, several memory protection techniques are implemented, so the accessto memory locations assigned to programs or operating system procedures is strictlycontrolled and unauthorized attempts to modify data or code are blocked.

Another drawback of this architecture is the Von Neumann bottleneck, term usedto refer to the difference between the throughput of the CPU and the transfer rate ofthe memory system, due to the different technologies used for their implementation.The separation between the CPU and the memory system is implicit in the VonNeumann architecture and the progress in integrated technologies allow CPUs toreach very high speed in computation. As a consequence data load/store from/tomemory could be performed at a high data transfer rate. On the other hand, sincethe memory system has to store a huge amount of data and code, increasing its tim-ing performance would increase the cost of the whole system. In certain applications,when the processor executes a limited number of instructions on a relevant amountof data, the memory access bottleneck introduces a serious reduction in elaborationspeed. The increasing of CPU speed and the requirement of big quantities of mem-ory for programs and data, have historically made the Von Neumann bottlenecka substantial problem. To alleviate this problem caching techniques improvementsare used on most advanced architectures as described in the following paragraphs.

2LISP stands for LISt Processor and it is a High Level Language often used in artificial intelli-gence projects.

32


ALU

accumulator

MAIN MEMORY

OUTPUTINPUT

UNIT

CONTROL

Figure 2.1. Von Neumann architecture model

2.2 Harvard architecture

In contrast with the Von Neumann architecture, the Harvard architecture uses sep-arate storage and signal pathways for data and code. This type of architecture wasintroduced in the Harvard Mark I relay-based computer, which stored instructionsto be executed on punched tape and data in relay latches. In a Harvard architecturethere’s no need for data and code memories sharing, so they can differ for wordwidth, timing, implementation technology, addressing logic and structure. Thisparticularity of the Harvard architecture could fit specification of some systems inwhich code and data have noteworthy differences in word width or when the sizesof the two memories influences addressing modes. Moreover, if the data memory isoften implemented by using a random access memory, the code memory could beread only. In a Harvard architecture the CPU can read an instruction from the codememory while reading or writing from or to data memory and can fetch the nextinstruction when another operation completes. This behavior makes the Harvardarchitecture faster with respect to the Von Neumann structure, but, obviously, thecost in terms of area and architecture complexity increases.

Also this architecture suffers from the effects of CPU speed with respect to themain memory timings so, if a program needs to access the memory at every clockcycle, the achievable throughput is closer to memory speed, thus CPU performancecan not be exploited. The use of caching techniques improves also in this case theoverall performance, as reported in the section 2.5.

33


Figure 2.2. Harvard architecture model

2.3 The increased processor complexity

Microprocessor architecture complexity has increased more and more for many years,this allowed manufacturers to conquer the market offering high performance ma-chines, maintaining as much as possible a cost-effective tradeoff.One of the first reasons of the increased architecture complexity was the noteworthyspeed difference between CPUs and available memories [2]. When the throughputof CPUs has become ten times higher than the main memory, the memory accessesissue has been tackled by the increasing CPU capabilities. Since many ”higher-level”operations, as floating point subroutines, have been included in the instruction set ofmany commercial processors, some primitives implemented as subroutines becameinstructions, with dramatic gains in computation speed.

Another aspect that made more cost-effective the increase of CPUs complexity isthe evolution of integration technologies, so that microprogramming got advantageon hardwired control logic. Microprogramming techniques are alternative solutionsto reduce the integrated logic complexity. They consist in making use of internal in-structions, called micro-instructions, which are stored directly into the control unit,so that a machine instruction is translated in a set of simpler operations executedby the architecture during some micro-steps. Small integrated memories allowed tostore the microcode directly in the control unit and, because of some technologyprocess aspects, this set of microinstructions should often be expanded with zeroor very little overhead and costs. Also this trend caused the integration of somecapabilities previously committed to external subroutines, such as string editing,integer-to-floating point conversions and other data conversions.

Due to expensiveness of memories, one of the main design constraints is to havevery compact programs and this is another reason for instruction sets complexity

34


growth. In fact, very complex instruction sets, were considered the optimal solutionto obtain an high code density. Attempting to obtain code density by increasingthe complexity of the instruction set is an arduous task, because supporting a bignumber of instructions and addressing modes means more bits to represent them andso more memory to store programs. For the previous reasons the code compactioncould only be obtained by cleaning up the instruction set, instead of increasing theinformation to be encoded. The cost of incrementing the available memory, anyway,is often far cheaper than the introduction of architecture innovations on CPUs andthe use of larger PLA3 also reduce the performance, due to decoding delays.

In terms of marketing strategies, the upward compatibility needed to guaranteethe software compatibility with newer machines, has led to introduce more complex-ity. A new architecture has to be completely compatible with their predecessors’machine language, so the older instructions and addressing modes must not be re-moved from the instruction set. The concept of computer family was introduced asa method to guarantee the common feature of running the same software, this wasimplemented by using hardware or microprogram emulation on some processors.Another solution to improve the design characteristics was to add new features, thisincreased both the instruction number and their complexity.

The increasing popularity of high level languages (HLL) led to new complexinstructions, trying to cover the semantic gap between the processors capabilitiesand the single HLL instructions computation requirements.

Due to the introduction of multiprogramming techniques and time sharing be-tween many processes, processors need not only the implementation of interrupts,halting the execution of processes and retaking their execution in a later time, butalso of different operating modes to protect processes execution. Memory man-agement and paging require to add particular functionalities to processors, in factan instruction can be halted before his completion and then restarted after somememory operations have been executed.

The use of complex instruction sets and addressing modes increased the amountof information to be saved for every interrupt, this enlarged both the number ofshadow registers and the microcode to be executed for interrupt management. Whenbuilding real-time computing systems, worst-case response in terms of time oughtto be granted. This requirement could be fulfilled by reducing the CPU interruptlatency. This issue is critical not only for Digital Signal Processors (DSP), but alsofor microcontrollers. In fact, automatic control applications need to run real-timeroutines, so the maximum response times must be ensured. This is accomplishedby freezing the processor internal state and satisfying interrupts requests as soon aspossible. However, as detailed in the previous paragraph, with this approach the

3Programmable Logic Array, is a programmable device used to implement combinational logiccircuits, particularly instruction decoder within most processors.

35


number of registers increases according to the operation complexity, particularly tostore intermediate states of microcode execution.

Other reasons for the instruction set complexity growing are furnished by com-puter programmers who use the assembly language: they want a CPU to support afull featured instructions set (e.g. orthogonal4 instruction set).

Some further considerations can be made about technology aspects. Improvingthe performance of a complex processor, by introducing architectural innovations,takes much time and the market trends have to take in account the design time ofa new product, especially time-to-market and time-to-volume parameters, but alsothe possibilities given by the most recent semiconductors technology. Long designtimes could lead to a product which uses a target technology old of months or years,instead of try to pioneer a new technology.

2.4 The RISC architecture

RISC stands for Reduced Instruction Set Computer and refers to a simpler platformwith respect to Complex Instruction Set Computer (CISC ) processors. The termRISC is in antithesis with the term CISC used, since then, to refer more complexplatforms implementing instruction sets with many operations and several address-ing modes, as seen in the paragraph 2.3.The fundamental feature of this approach is to execute all operations only betweenthe processor registers, accessing the memory exclusively by a couple of operations,load and store, for loading and saving data to and from registers. For this reason theequivalent name of load-store architecture is frequently used to refer RISC processors[4]. Other features common to RISC architectures are:

• Uniform instruction encoding, all the instructions are expressed by the samenumber of bit and the opcode (the bit field representing a unique operation) isalways in the same bit position in each instruction: this approach allows fastdecoding.

• A homogeneous register set, allowing any register to be used in any contextand simplifying compiler design (although there are almost always separateinteger and floating point register files).

• Simple addressing modes (complex addressing modes are replaced by sequencesof simple arithmetic instructions for the address calculation).

4A computer’s instruction set is said to be orthogonal if any instruction can use data of anytype via any addressing mode, so programming is simpler, but complexity increases as every of theaddressing mode must be supported by every instruction.

36


• Few data types supported in hardware (for example, some CISC machines hadinstructions for dealing with byte strings, others had support for polynomialsand complex numbers); a RISC machine reduces the number of native datatypes as much as possible.

The RISC CPU design philosophy was inspired by some observations about thereal usefulness of many features included in previous processors, which appeared tobe quite overdesigned and so less cost-effective with respect to simpler approaches.While some of the initial assumptions about the throughput difference between thememory and the CPU has changed, due to the introduction of faster semiconductormemories and the cache memories technology, the increasing complexity introducedother side effects.

The execution of some complex instructions on particular processors, requiredmore time than the alternative set of simple instructions which execute the samefinal set of operations, allowing to discover several architectural limits.Moreover, complex architectures has the problem of debugging, particularly for themicroprogram control. To allow corrections to microcode parts some manufacturersimplemented rewritable microprogram areas within the processors, this allows main-tenance even the processors are used in field, by distributing firmware updates tocustomers. Another solution adopted by processor producers was to place, beside therewritable microprogram storage, an FPGA5 to patch also parts of the architectureand not only microcode.

In the late ’70s some researchers, from various computer companies, demon-strated that the majority of the addressing modes implemented in the orthogonalinstruction sets were unused by most programs. This was a side effect of the increas-ing use of HLLs and compilers to generate the programs, as opposed to write themin assembly. In fact, compilers used at the time, had only a limited ability to takeadvantage of the features provided by the complex instruction sets. The market wasclearly moving to even wider use of HLLs and compilers for software development,diluting the usefulness of the orthogonal instruction sets even more. Moreover, sincethese operations were rarely used, they tended to be slower than smaller operations.

Another part of RISC design came from practical measurements on real-worldprograms. Some researchers showed that 98% of all the constants in a programwould fit in just 13 bits, yet almost every CPU design dedicated some multiple of8 bits to store them, typically 8, 16 or an entire 32-bit word. Taking this fact intoaccount suggests that a machine should allow for constants to be stored in unusedbits of the instruction itself, decreasing the number of memory accesses.

Since real-world programs spend most of their time executing very simple op-erations, some researchers decided to focus on making those common operations assimple and fast as possible. Since the clock rate of the CPU is limited by the time

5Field Programmable Grid Array

37


it takes to execute the slowest instruction, speeding up that instruction (e.g. by re-ducing the number of addressing modes it supports) also speeds up the execution ofevery other instruction. The goal of RISC architectures was to make instructions assimple as each one could be executed in a single clock cycle. Code was implementedas a sequence of these simple instructions, instead of single complex instructions.This led to the possibility to insert data within an instruction reducing the needto use registers or memory. However, since a series of instructions is needed tocomplete even simple tasks, the total number of instructions read from memory islarger, and therefore it takes longer.

Most of the designs, that made the history of RISC, were the results of uni-versity research programs on VLSI technologies started in the early ’80s. The firstnoteworthy study was the Berkeley’s RISC project, directed by David Patterson,followed by the MIPS project, started at Stanford University and directed by JohnL. Hennessy. The RISC project was based on gaining performance through the useof pipelining and a large use of registers known as register windows. A normal CPUhas a small number of registers and a program can use any register at any time.In a CPU with register windows, there is a huge number of registers, but programscan only access a small number of them, according to certain rules. A program thatemploys few registers for any procedure, can make very fast procedure calls, in fact,the call and the return, simply move the window to the subset of few registers usedby that procedure. In a normal CPU, most calls push the content of the registers toRAM to clear enough working space for the subroutine and the return restores thosevalues. The RISC project delivered the RISC-I processor in 1982. Being made ofonly 44,420 transistors (compared to about 100,000 in newer CISC designs) RISC-Ihad only 32 instructions, and yet completely outperformed any other single-chipdesign. With the 40,760 transistor and 39 instructions, in 1983 the RISC-II wasdelivered: it ran over three times faster than RISC-I.

The MIPS project focused almost entirely on the pipeline, although pipeliningwas already in use in other designs, several features of the MIPS chip made itspipeline far faster. The most important feature was the insure that all instructionscomplete in one cycle. This requirement allowed the pipeline to run at higher speedand is responsible for much of the processor’s performance. However, it also had thenegative side effect of eliminating many potentially useful instructions, like multiplyor divide, which necessarily require more clock cycles to execute.

The earliest attempt to create and manufacture a CPU based on the RISC phi-losophy was a new project at IBM, which started in 1975. The work led to theIBM 801 CPU family which was widely used inside IBM hardware. In 1981 the 801was eventually produced in a single-chip as the ROMP, Research (Office ProductsDivision) Mini Processor. Nevertheless, the 801 inspired several research projects,including new ones at IBM that would eventually lead to their POWER system.

38


Berkeley’s research was not directly commercialized, but some years later the RISC-II design was used by Sun Microsystems to develop the SPARC architecture, byothers to develop mid-range multi-processor machines and by almost every othercompany a few years later. It was Sun’s use of a RISC architecture in their newchips that proved RISC’s benefits were real, and their architectures quickly outpacedthe competition and essentially took over the entire workstation market.On the other hand, MIPS went on to become one of the most used RISC architec-ture when it was included in the PlayStation and Nintendo 64 game consoles. Inthe same years, IBM, went on to design new machines based on their new POWERarchitecture and also moved their well known AS/400 systems to POWER chips,discovering that the modified system ran considerably faster with respect to thehighly complex instruction set system used before. POWER architecture was alsofundamental to develop the PowerPC design, which eliminated many of the “IBMonly”instructions and created a single-chip implementation. Today the PowerPC isone of the most commonly used CPUs for embedded and automotive applications.It was also the CPU used in most Apple Macintosh machines sold until 2006, beforeApple switched their PowerPC products to Intel x86 processors.

In the late ’80s Intel released the i860 and i960, whereas, Motorola built a newdesign called the 88000 in homage to its famed CISC 68000, but eventually aban-doned it and joined IBM to produce the PowerPC. AMD released their 29000 whichwould go on to become the most popular RISC design of the early ’90s. Today thevast majority of all CPUs in use are RISC CPUs and particularly microcontrollers.RISC architectures offer power in even small sizes, and thus have come to completelydominate the market for low-power embedded CPUs, which are by far the largestmarket for processors. In fact, while a family may own one or two PCs, their cars,cell phones, and other devices contain a number of embedded processors. RISCarchitectures had also completely taken over the market for larger workstations formuch of the ’90s. After the release of the Sun SPARCstation other vendors rushedto compete with RISC based solutions of their own. Even the mainframe world isnow completely RISC based. However, despite many successes, RISC has missedthe desktop PC and commodity server markets, where Intel’s x86 platform remainsthe dominant processor architecture (it must be considered that AMD’s processorsimplement the x86 platform, or a 64-bit superset known as x86-64). The primaryreason of this incomplete revolution is that the large base of proprietary PC ap-plications are written for x86, whereas no RISC platform has a similar installedbase. The second reason is that, although RISC was indeed able to scale up inperformance quite quickly and cheaply, Intel took advantage of its large market byspending vast amounts of money on processor development. The first x86 CPU todeploy RISC techniques was the NextGen Nx586, released in 1994, it expanded themajority of the CISC instructions into multiple simpler RISC operations. Internallythe Nx586, Intel P6, AMD K5 and Cyrix 6x86 are RISC machines that emulate a

39


CISC architecture.In 2004 x86 chips were the fastest CPUs in SPECint displacing all RISC CPUs,

but the fastest CPU in SPECfp6 is the IBM Power 5 processor.Still, RISC designs have led to a number of successful platforms and architec-

tures, some of the larger ones being:

• MIPS line, found in most SGI computers and the PlayStation, PlayStation 2,PlayStation Portable and Nintendo 64 game consoles.

• IBM’s and Freescale’s (formerly Motorola SPS) Power Architecture, used inall of IBM’s supercomputers, midrange servers and workstations, in Apple’sPower Macintosh computers, in Nintendo’s Gamecube and Wii, Microsoft’sXbox 360 and Sony’s PlayStation 3 game consoles, and in many embeddedapplications like printers and automotive applications.

• Sun’s SPARC and UltraSPARC, found in all of their later machines.

• Hewlett-Packard’s PA-RISC, also known as HP/PA.

• DEC Alpha, still used in some of HP’s workstation and servers.

• XAP processor used in many wireless chips, e.g. Bluetooth.

• ARM Palm, Inc. originally used the (CISC) Motorola 680x0 processors in itsearly PDAs, but now uses ARM processors in its latest PDAs; Apple Computeruses the ARM 7TDMI in its iPod products; Nintendo uses an ARM7 CPU inthe Game Boy Advance and both an ARM7 and ARM9 in the Nintendo DShandheld game systems; the small Korean company Game Park also marketsthe GP32, which uses the ARM9 CPU; many cell phones from like Nokiaproducts are based on ARM designs.

• Hitachi’s SuperH, originally in wide use in the Sega Super 32X, Saturn andDreamcast, now at the heart of many consumer electronics devices; the SuperHis the base platform for the Mitsubishi - Hitachi joint semiconductor group.

6SPECint and SPECfp are computer benchmark specifications for CPU’s integer and floatingpoint performance calculations, they are maintained by the Standard Performance EvaluationCorporation (SPEC).

40


2.5 Pipelining and cache technology

In the early ’80s it was thought that existing CPUs were reaching the theoreticallimits. Future improvements in speed would be primarily achieved through improvedsemiconductor ”process”, that is, smaller technology process features (transistorsand wires). The complexity of the chip would remain near the same, but the smallersize allows to run at higher clock rates.

A crucial structure introduced in CPU design was the pipeline, i.e. a chain ofregisters, which would break down instructions into steps, and work on one step ofseveral different instructions at the same time. A very simple processor might readan instruction, decode it, fetch from the memory the data asked for, perform theoperation, and then write the results in registers or memory. The key to pipeliningis the observation that the processor can start reading the next instruction as soonas it finishes reading the last, meaning that there are now two instructions beingworked on (one is being read, the next is being decoded), and after another cyclethere will be three, because the previous decoded instruction could be executed in athird stage. While no single instruction is completed any faster, the next instructionwould complete right after the previous one. The result was a much more efficientutilization of processor resources and these techniques are ordinarily used both onRISC and CISC processors. The goal of a pipelined architecture is to keep thepipeline full of instructions at all times and recent processors have very complexpipeline structures, with many stages.

The use of the pipeline was primarily a characteristic of RISC designs, whichshared a not-so-nice feature referred as branch delay slot. The branch delay slot isa side-effect of pipelined architectures due to the branch hazard7, i.e. the fact thatthe branch would not be resolved until the instruction has crossed several stagesof the pipeline, reaching the execution stage to update the program counter. Thelocation of such an instruction in the pipeline is called a branch delay slot. Sincethe fetched and decode operations on the following instructions must be executeddepending on the branch instruction, the pipeline needs to be flushed if the programexecution continues at a new memory location. Consequence of a branch hazard canbe unwanted actions performed by the processor and so opportune measures mustbe taken to avoid these behaviors. A simple design would insert stalls into thepipeline after a branch instruction until the new branch target address is computedand loaded into the program counter, in this case each cycle where a stall is insertedis considered one branch delay slot.

A more sophisticated design would execute the instructions which do not de-pend by the result of the branch instruction. Moreover, this optimization, must beperformed in software at compile time and consists in moving branch independent

7Branch hazards are also known as control hazards.

41


instructions into branch delay slots. Modern techniques resolve this problem byusing branch prediction algorithms and speculative execution, so that many branchdelay slots are efficiently exploited, reducing the performance penalty.

Branch prediction is based on complex algorithms which allow the processor toestablish whether a conditional branch in the instruction flow of a program is likelyto be taken or not. So it is possible to fetch and decode the right instructions withoutwaiting for a branch to be resolved. Another prediction method is implemented bythe branch target predictor, which attempts to guess the target of the branch orunconditional jump before it is computed, by parsing the instruction itself. Branchpredictors are crucial in today’s processors for achieving high performance. When aconditional branch instruction is encountered, the processor guesses which way thebranch is most likely to go (this is called branch prediction), and immediately startsexecuting instructions from that point. If the guess later proves to be incorrect, allcomputation past the branch point is discarded. The early execution is relativelycheap because the pipeline stages involved would otherwise be frozen until the nextinstruction is known. However, wasted instructions consume CPU cycles that couldhave otherwise delivered performance, and on a mobile device, consume batteries,representing a penalty for a mispredicted branch. The execution of code whoseresults can be unuseful is called speculative execution or out-of-order execution and,as discussed above, these techniques can improve processors performance.

Another solution to improve instruction throughput was to use several processingelements inside the processor and run them in parallel. Instead of working on oneinstruction to perform an ALU operation, these superscalar processors would lookat the next instruction in the pipeline and attempt to run it at the same timein an identical unit. However, this can be difficult to do, as many instructionsin computing depend on the results of some previous instructions. Anyway, mostof modern processors, have more than one execution unit for integer numbers, aseparate unit for floating point numbers and sometimes circuitry for independentmemory address calculations. All these units could be used at the same time ifthere are not data hazards, i.e an operation executed in a unit which depends on aprevious instructions not completely executed yet. Both of these techniques reliedon increasing speed by adding complexity to the basic CPU architecture, as opposedto the instructions running on them.

Another problem with pipelined processor occurs when a part of the processor’shardware is needed by two or more instructions at the same time. A situation likethis is called a structural hazard and might occur, for instance, if a program wantsto execute a branch instruction followed by a computation instruction. Becausethey are executed in parallel, and because branching is typically slow (requiringa comparison, program counter-related computation, and writing to registers), it isquite possible (depending on architecture) that the computation instruction and thebranch instruction will both require the ALU at the same time. The most simple

42


solution of a structural hazard is the insertion of one or more pipeline stall cycles or asophisticated algorithm which may consent a different scheduling of the instructions,without losses in instruction throughput.

Figure 2.3. Pipelined processor example

Using a small amount of fast memory between the CPU and the main memoryto store a copy of the most frequently used data, it is possible to reduce the averagetime to access needed resouces. This kind of memory is named cache memoryand could be defined as “a temporary storage area where frequently accessed datacan be stored for rapid access”. In modern computers the main memory needs tohave much room for data and code and so must have low cost per megabyte, withdramatic effects on latency and bandwidth. If the main memory is implementedwith dynamic memories, cache memories, which are very small (few megabytes),must be as efficient as possible, with high performance and very low latency, sothey’re manufactured by using static memories. As long as most memory accessesare to cached memory locations, the average latency of memory accesses will becloser to the cache latency than to the latency of main memory. The structure ofthe access to the caches is often different. In most modern architectures, in fact, theCPU accesses two different portions of cache, one for the instructions and one forthe data (as in Harvard architecture), but in case of a cache miss (a required datawhich is not available in the portion of memory copied into the cache), the data isretrieved from the main memory which is unique for code and data. For this reason,the off-chip memory resources are managed as in the Von Neumann architecture, so

43


the global platform inherit the structure of both the models described in the previousparagraphs. RISC designs are also more likely to feature a Harvard memory model,where the instruction stream and the data stream are conceptually separated; thismeans that modifying the addresses where code is held might not have any effecton the instructions executed by the processor (because the CPU has a separateinstruction and data cache), at least until a special synchronization instruction isissued. On the upside, this allows both caches to be accessed simultaneously, whichcan often improve performance.

instructioncache

CPU

datacache

MMUmain

memory

Figure 2.4. Cached processor example

RISC was tailor-made to take advantage of pipelined and caching techniques,because the core logic of a RISC CPU was considerably simpler than in CISC de-signs. Although the first RISC designs had marginal performance, they were ableto quickly add these new design features and by the late 1980s they were signifi-cantly outperforming their CISC counterparts. In time, this would be addressed asan optimal structure and the improvements in technology processes led to the pointwhere all of this could be added to a CISC design and still fit on a single chip, butthis took most of the late-80s and early 90s. Generally for any given level of generalperformance, a RISC chip will typically have many fewer transistors dedicated tothe core logic than a CISC. This allows the designers considerable flexibility as theycan:

• Increase the size of the register set.

• Implement measures to increase internal parallelism.

• Increase the size of caches.

• Add other functionality, like I/O and timers for microcontrollers.

• Build the chips on older fabrication lines, which would otherwise go unused.

• Offer the chip itself for battery-constrained or size-limited applications.

44


2.6 RISC vs CISC architecture

In certain circumstances RISC architectures offer significant advantages over CISCplatforms and viceversa, so that the majority of today’s processors can not rightfullybe identified as completely RISC or CISC [5]. The two architectures have evolvedtowards each other so that there’s no longer a clear distinction between their re-spective approaches to increasing performance and efficiency. As already discussedin the previous paragraphs, CISC architectures are based on reducing the amountof time spent retrieving instructions from memory by concentrating machine ele-mentary operation in more complex instructions, although these instructions needmultiple clock cycles to execute.

A typical CISC processor has most of the following properties:

• Uses microcode to simplify control unit’s architecture: the microcode is readfrom a resident ROM instead of implementing all in hardware.

• Has improved performance, since instructions could be retrieved up to tentimes faster from ROM than from main memory.

• Instructions of variable size.

• Rich instruction set, including simple and complex instructions.

• Has large number of addressing modes.

• Has a small number of general-purpose registers, typically about 8 registers;this is a result of having instructions which can operate directly on memory(which means no address storing).

• Instruction interface with memory in multiple ways with complex addressingmodes.

• Instructions generally take more than one clock cycle to execute.

• Orthogonal instruction set.

A RISC architecture has most of the characteristics listed here:

• Makes use of a small simplified instruction set in attempt to improve perfor-mance via a simpler architecture.

• Instructions execute in only one clock cycle.

• Uses pre-fetching techniques coupled with speculative execution (out-of-orderexecution).

45


• Pipelining.

• Instruction interface with memory via fixed mechanisms (load/store).

• Fast floating point performance.

• Few addressing modes.

• Large number of registers.

• Hardwired design (no microcode).

• Heavily rely on the compiler.

CISC machines have a variety of instruction formats for a large number of in-structions and instruction groups, this makes decoding more difficult and more timeintensive. RISC greatly simplifies the instruction format for easy and fast decoding.

Although the CISC architecture improves computer performance, it still hassome drawbacks:

• Instruction set and chip hardware became more complex with each generationof computers, since earlier generations of processor family were contained as asubset in every new version.

• Different instructions take different amount of time to execute due to theirvariable-length.

• Many instructions are not used frequently; approximately 20% of the availableinstructions are used in a typical program.

On the other hand, RISC architectures suffer from some drawbacks:

• Programmers must pay close attention to instruction scheduling so that theprocessor does not spend a large amount of time waiting for an instruction toexecute.

• Debugging can be difficult due to the instruction scheduling.

• Require very fast memory systems to feed instructions.

The primary reason for RISC arise is essentially the fact that, when RISC phi-losophy was introduced, CISC processors were manufactured using more than oneSSI8 chip. Although VLSI technologies made the above problems even more critical,several factors indicated RISC architectures as a reasonable design alternative. The

8Short Scale Integration

46


first factor is the implementation feasibility: a great deal depends on being able to fitan entire CPU design on a single chip. A complex architecture is exposed to manyrealization problems in a given technology than a simple one, so improvement inVLSI technology will eventually make a single chip version dramatically unfeasible.RISC computers, therefore, benefit from shorter time-to-market than CISC ones.Design complexity is a crucial factor in the RISC architecture growth, in fact, ifVLSI technology continues to almost double chip density roughly every two years,a design that takes only two years can potentially use a much superior technologyand hence be more effective than a design that takes four years. RISC architecturesdemonstrated to obtain a better use of chip area: the area gained back by designinga RISC architecture rather than a CISC can be used to improve the RISC capabil-ities. For example, the entire system performance might improve if silicon area isused for on-chip caches or registers or even pipelining.The CISC also suffers from the fact that its intrinsic complexity often makes ad-vanced techniques even harder to implement. The ultimate test for cost-effectivenessand efficiency of a processor is the speed at which an implementation executes a givenalgorithm.

Better use of chip area and availability of newer technology through reduced de-bugging time contribute to the speed of the chip. A RISC potentially gains in speedmerely from a simpler design. Many of today RISC cores support just as manyinstructions as yesterday’s CISC chips, like PowerPC 601, which supports moreinstructions than the Pentium. Furthermore, today’s CISC CPUs, use many tech-niques formerly associated with RISC chips. In conclusion the difference betweenRISC and CISC approach is getting smaller and smaller.

At the present RISC processors can take advantage on CISC for some otherreasons than silicon area, like power consumption, environmental prescriptions, in-terrupt structure and costs, particularly in automotive and portable devices markets.A RISC based system can consume few compared to a CISC. Environmental aspectsare influenced by this parameters, because a RISC-based device can work in hightemperature places and often does not require low EM emissions certification asCISC does. The latter needs fans and efficient cool systems in particular environ-ments. Costs per chip are also in advantage of RISC, from some dollars to hundredsdollars for the CISC implementations, but this is not all, because of the costs ofsystem which have to mount a more complex processor, implying many strictly re-quirements. Due to competition on x86 processor prices and also if RISC prices aredropping; a workstation using the CISC x86 PII architecture is less expensive thana Sun UltraSPARC machine which is equal performing.

The biggest threat for CISC and RISC might not be each other, but a newtechnology called EPIC, which stands for Explicitly Parallel Instruction Computing.Like the acronym says the EPIC project points to execute many instructions in aparallel way and this an Intel and Hewlett Packard 64-bit architecture project (also

47


identified as IA-64) which led to the development of the Itanium processors. Theroot of EPIC can be found in the Instruction Level Parallelism (ILP) philosophy,which uses the compiler to identify and leverage opportunities for parallel executionof instructions. Exploiting ILP techniques is possible to eliminate complex on-diescheduling circuitry in the CPU, freeing up space and power for other functions,including additional parallel execution resources. On the other hand, this approachcan be exploit in a more explicit manner: VLIW architectures support multipleoperations encoding in every instruction, and then process these operations by thesame multiple execution units as discussed above. The goal of the EPIC philoso-phy is to produce a “post-RISC era”architecture that would address some of thekey challenges faced by older RISC and CISC architectures enabling more efficientperformance scaling in future processor designs.

48

Chapter 3

The ARM microprocessorarchitecture

The ARM architecture is a 32-bit RISC architecture widely used in embedded de-signs and, due to power saving features, ARM CPUs are dominant in the mobileelectronics market, where low power consumption is a critical design goal. ARM isthe acronym of Advanced RISC Machine and, prior to that, Acorn RISC Machine;ARM Ltd. is a society spun off by the Acorn Computer Company in 1990, with thetarget to develop this family of cores in collaboration with Apple Computers Inc..

The processor modelled in this thesis work is a ARM7TDMI, which is a mem-ber of the ARM general purpose microprocessor family. This processor offers highperformance with low power consumption, so is one of the most diffused embed-ded microprocessor for mobile products like PDAs, mobile phones, media players,handheld gaming units, and calculators. The ARM processor family is based on theRISC principles and ARM7TDMI is implemented by a Von Neumann architecture;the instruction set and related decoding mechanism are much simpler than thoseof microprogrammed complex instruction set machines. This simplicity results in ahigh instruction throughput and impressive real-time interrupt response for a smalland cost-effective chip. Pipelining is exploit so that all parts of the processing andmemory systems can operate continuously. ARM7TDMI has a three stage pipelineand so, while one instruction is being executed, its successor is being decoded, anda third instruction is being fetched from memory. To ensure the instruction anddata feeding to the processor units, a prefetch mechanism is also provided, so thatthe pipeline stages are nominally four instead of three. The ARM memory interfacehas been designed to allow good performances maintaning low costs for the memorysystem implementation. Speed-critical control signals are pipelined to allow systemcontrol functions to be implemented in standard low-power logic.

49

3 – The ARM microprocessor architecture

Figure 3.1. ARM7TDMI core

3.1 The ARM processor family

The ARM design was started in 1983 as a development project at Acorn ComputersLtd. and the first samples called ARM1 were available in 1985 [6]. The first notewor-thy production of the ARM family processors, reached the market in the followingyear with the ARM2. The ARM2 featured a 32-bit data bus, a 26-bit memory ad-dressing space (64 Mbyte addressing range) and 16 and 32 bit wide registers. One ofthese registers served as the (word aligned) program counter with its top 6 bit and

50


lowest 2 bit holding the processor status flags. The ARM2 was possibly the sim-plest useful 32-bit microprocessor designed, with only 30,000 transistors (comparedwith Motorola’s six-year older 68000 with around 70,000). Much of its simplicitycomes from not having microcode (which represents about one-fourth to one-thirdof the 68000 area occupation) and, like most CPUs of the day, not including a cache.This simplicity led to its low power usage, while performing better than the Intel286 processor. Its successor, ARM3, was produced with a 4KB cache which furtherimproved the ARM2 performance.

In the late ’80s the ARM architecture was deeply revised, when Apple ComputerInc. started working with Acorn on newer versions of the core. The work was soimportant that Acorn spun off the relative design team in 1990 into a new companycalled Advanced RISC Machines Ltd.. Advanced RISC Machines became ARMLtd when the parent holding company was listed on London and New York StockExchanges. Meantime, the first models of ARM6 were realized (1991); Apple usedthe ARM6-based ARM 610 as the basis for their Apple Newton PDA and in 1994,Acorn used the ARM 610 as the main CPU in their Risc PCs. In these evolutionsthe core has remained largely the same size, ARM2 had 30,000 transistors, whilethe ARM6 grew to only 35,000. The idea is that the Original Design Manufacturercombines the ARM core with a number of optional parts to produce a completeCPU, one that can be built on old semiconductor fabs and still deliver lots of per-formance at a low cost. While ARM’s business has always been to sell IP cores, someof the licensees generated microcontrollers based on this core. The most successfulimplementation has been the ARM7TDMI, with hundreds of millions pieces sold inmobile phones and handheld video game systems. DEC licensed the architectureand produced the StrongARM, a 233 MHz CPU which drew only 1 watt of power(more recent versions draw far less). This work was later passed to Intel as a partof a lawsuit settlement. Intel later developed its own high performance implementa-tion known as XScale and the common architecture supported on Windows Mobilesmartphones, Personal Digital Assistants and other handheld devices.

3.2 The Thumb concept

The ARM7TDMI processor employs a unique architectural strategy denominateThumb, which makes it ideally suited to high-volume applications with memoryrestrictions, or applications where code density is an issue. Beside the 32-bit in-struction set (usually called ARM instruction set), ARM7 processor, has also areduced 16-bit instruction set, named Thumb instruction set. The 16-bit long in-structions of the Thumb instruction set allow to improve the density of standardARM code retaining most of the ARM performance advantage with respect to tra-ditional 16-bit processor using 16-bit registers. Thumb code, in fact, operates on

51


32-bit registers as ARM code, but is able to provide up to 65% of the code size ofARM, and 160% of the performance of an equivalent ARM processor connected toa 16-bit memory system ([16]). It may be viewed as a compressed form of a subsetof the ARM instruction set, Thumb instructions map onto ARM instructions, andthe Thumb programmer model maps onto the ARM programmer model.

Thumb is not a complete architecture, therefore the Thumb instruction set sup-ports only common application functions, allowing recourse to the full ARM in-struction set where necessary (for instance, all exceptions automatically enter ARMmode). An application can mix ARM and Thumb subroutines in a flexible mannerto optimize both performance and code density. In the some applications the useof the Thumb instruction set can improve power-efficiency, save cost and enhanceperformance all at once.

Thumb instructions operate on a subset of the standard ARM register configu-ration, allowing excellent interoperability between ARM and THUMB states, alsoin the same program code. Each 16-bit Thumb instruction has a corresponding32-bit ARM instruction with the same effect on the processor model and the imple-mentation of Thumb architecture use dynamic decompression and then instructionsexecute as standard ARM instructions within the processor. The major advantageof a 32-bit (ARM) architecture over a 16-bit architecture is its ability to manipu-late 32-bit integers with single instructions, when processing 32-bit data, indeed, a16-bit architecture will take at least two instructions to perform the same task asa single ARM instruction. If a 16-bit architecture only has 16-bit instructions, anda 32-bit architecture only has 32-bit instructions, then overall the 16-bit architec-ture will have better code density, and better than one half the performance of the32-bit architecture. Clearly 32-bit performance comes at the cost of code density,the Thumb mode available on the ARM7 breaks this constraint by implementing a16-bit instruction length on a 32-bit architecture, making the processing of 32-bitdata efficient with a compact instruction coding. This provides far better perfor-mance than a 16-bit architecture, with better code density than a 32-bit architecture.This is the ability to switch back to full ARM code and execute at full speed. Thuscritical loops for applications such as fast interrupts and DSP algorithms, which canbe coded using the full ARM instruction set, and linked with Thumb code.

The Thumb architecture, using 32-bit registers, can also address a large memoryspace efficiently. In this thesis work, anyway, the Thumb architecture is not treatedand so the modelled architecture can only execute code which not make use ofprocessor mode change.

52


3.3 The programmer model

3.3.1 Operating states and state switching

The ARM processor can operate in one of two possible states: ARM state orTHUMB state. When it operates in ARM state, executes 32-bit word-aligned in-structions decoding them with respect to the ARM instruction set. If the processoris in the THUMB state, otherwise, the instructions are halfword-aligned and thebit 1 of the program counter indicates which of the two halfwords is selected forfetching; the decoding is done in according with the 16-bit Thumb instruction set.Switching from a state to the other is possible by using the branch1 and exchange(BX) opcode and the target state is defined by the least significant bit of the valuecontained in the operand register. Entry into THUMB state can be achieved byexecuting a BX instruction with the state bit (bit 0) set in the operand register.Transition to THUMB state will also occur automatically on return from an ex-ception, if the exception was entered with the processor in THUMB state. Entryinto ARM state happens on execution of the BX instruction with the state bit clearin the operand register or automatically at every time the processor takes an ex-ception. In this second case, the PC is placed in the exception mode link register,and execution commences at the relative exception vector address. The exceptionhandling is described with more details in paragraph 3.4.

3.3.2 Memory formats and data types

The processor views memory as organized as a linear collection of bytes, numberedupwards starting from zero, so bytes 0 to 3 hold the first stored word (32-bit word),bytes 4 to 7 the second and so on. The processor supports words stored in memoryboth in big endian and little endian format.

In big endian format, the most significant byte of a word is stored at the lowestnumbered byte and the least significant byte at the highest numbered byte. Byte0 of the memory system is therefore connected to data lines 31 through 24, byte 1is connected to data lines 23 through 16 and so on. The memory scheme for thisorganization is shown in fig.3.2.

In little endian format, the lowest numbered byte in a word is considered theleast significant byte, and the highest numbered byte the most significant. Byte 0 ofthe memory system is therefore connected to data lines 7 through 0, byte 1 to data

1a branch instruction allows to perform conditioned or unconditioned jumps to subroutine,i.e. to a portion of code within a larger program which performs a specific task and is relativelyindependent of the remaining code; for these reasons the subroutine code is stored at a differentaddress with respect to the main program code.

53


Figure 3.2. Big endian memory organization

lines 15 trough 8 and so on. The memory scheme for this organization is shown infig.3.3.

Figure 3.3. Little endian memory organization

The processor supports byte (8-bit), halfword (16-bit) and word (32-bit) datatypes. To grant the correct access to the memorized data, words must be aligned tofour-byte boundaries and half words to two-byte boundaries; there are no restrainton the single byte data alignment.

The memory interface and other details about the processor data managementare reported in the section 3.7. To obtain flexible sign extend capabilities signedand unsigned data types, particularly for byte and halfword sized data, some specificinstruction are also provided.

3.3.3 Operating modes

In order to support the normal program flow, but also different events like applica-tion unrecoverable errors, software or hardware interrupts, to furnish the privilegedfunctions of an operating system and to access some reserved resources, the proces-sor can work in several operating modes:

54


user (usr): the normal application program execution state;FIQ (fiq): designed to support a data transfer or debug process;IRQ (irq): used for general-purpose interrupt handling;supervisor (svc): protected mode for the operating system;abort mode (abt): entered after a data or instruction prefetch abort;system (sys): a privileged user mode for the operating system;undefined (und): entered when an undefined instruction is executed.

Application programs will normally execute in user mode. The other (non-user)modes, so called privileged modes, are entered in order to service interrupts or ex-ceptions or to access protected resources. Mode changes could be performed viaexplicit program instructions or using some dedicated input signals driven by pe-ripherals connected to the processor. A detailed description of these operating modesis furnished in the paragraph 3.4.

3.3.4 Processor resources

The processor has a total of thirty-seven 32-bit wide registers; thirty-one of these aregeneral purpose registers and the other six are status registers. Not all the registerscan be seen at once, five of the status registers and fifteen of the general purposeregisters are banked registers and can not be seen in user mode. The processor stateand operating mode dictate which registers are available to the programmer. Upto sixteen general purpose registers and one or two status registers are visible atonce. In privileged modes, mode-specific banked registers are activated and so canbe accessed by the programmer.The ARM state register set contains sixteen directly accessible registers, named R0to R15. All of these, except R15, are general-purpose and may be used to holdeither data or address values. In addition to these, there is a seventeenth registerused to store processor status informations (see section 3.3.5).

Register R15 holds the Program Counter (PC). In ARM state, bits [1:0] of R15are zero and bits [31:2] contain the instruction fetch address; this is because thealignment of 32-bit wide data in memory. In THUMB state, bit [0] is zero and bits[31:1] contain the fetch address; code words, in that state, are 16-bit wide and thehalf-word alignment has two alternative boundaries.

Register R14 is used as subroutine link register (LR), so it receives a copy of theprogram counter when a branch with link (BL) instruction is executed. All othertimes it may be treated as a general-purpose register. The corresponding bankedregisters R14 svc, R14 irq, R14 fiq, R14 abt and R14 und are similarly used to holdthe return values of R15 when interrupts and exceptions arise. These registersare also used when branch with link instructions are executed within interrupt orexception routines execution.

Non-user modes has some other banked registers which can be accessed, these

55


registers and their utilization are described in the paragraph 3.4, where exceptionhandling is discussed.

THUMB state has a reduced register set with respect to the ARM state. Itcan be substantially considered as a subset of the same ARM state register set,in fact eight general purpose registers can be accessed (numbered from R0 to R7and referred as low registers in the programmer’s model) and they maps onto therespective registers in ARM state. It must be underlined that these register are all32-bit wide, being the same both in ARM and THUMB state. Moreover, in THUMBstate, a stack pointer (SP), the link register (LR) and the program counter (PC),are available. These last three register map respectively onto R13, R14 and R15ARM state registers. In some particular conditions, also the other reserved registers(numbered from R8 to R15 and referred as high registers) can be accessed; this isuseful after a branch and exchange operation, for example. These resources can bemanaged by using the high registers operations belonging to the Thumb instructionset.

3.3.5 The Processor Status Registers (PSRs)

To store the internal state the processor is furnished of a 32-bit register namedCurrent Processor Status Register (CPSR). The single bit of the CPSR can begrouped relatively to their function, so the least significant bits are called controlbits and store the processor operating mode, the state bit and two flags for theinterrupt enabling and disabling. The most significant bits are destined to holdinformation about the most recently ALU operation performed. Not all PSR 32bit are used to store informations, some of them are free and reserved for futureupgrades and new functionalities of the processor family, though they can not beused. The arrangement of bits is shown in Figure 3.4. Beside the CPSR, otherfive banked register, one for each privileged mode, are available and they are calledSaved Processor Status Registers (SPSRs).Five of the control bits define in which operating mode (section 3.3.3) the processoris working; they’re indicated with M0 to M4 and named mode bits. The mode bitscan assume the values reported in table 3.1.

The subsequent bit is called T-bit and is a flag which indicates in which state,between ARM and THUMB, the processor is. An external signal (TBIT) reflectsthe T-bit state; if the bit is set the processor in in the THUMB state, works inARM state if reset. I-bit and F-bit represent respectively standard interrupt andfast interrupt disabling if set.The most significant bits of the CPSR are occupied by the condition code flags, usedfor the ALU operations and also for the evaluation of the condition which accompa-nies every processor opcode, determining whether the instruction must be executedor not.

56


Table 3.1. Mode bits possible valuesM[4:0] mode10000 user10001 FIQ10010 IRQ10011 supervisor10111 abort11011 undefined11111 system

These flags are:N bit: negative or less than flag;Z bit: zero result flag;C bit: carry or borrow or extend flag;V bit: overflow flag.

All these flags can be modified by an arithmetic or logic operation performed bythe ALU and the C bit is also used in arithmetic operations which consider previouscarry or borrow.CPSR and banked SPSRs are identical both in ARM and THUMB state.

Figure 3.4. Program status register format

3.4 The exception handling

Exception handling is a computer mechanism designed to manage the occurrence ofsome operating conditions that changes the normal flow of execution; a conditionwhich causes this behavior is properly called an exception. Sometimes this term

57


Figure 3.5. ARM7TDMI register set in ARM/THUMB state

is used only to designate error conditions and not to refer conditions that could beconsidered as part of the normal flow of execution, including interrupts management.In order to respect ARM7 architecture development guidelines, interrupts are alsoconsidered as exceptions and not only problematic situations. In presence of anexception the execution flow must be halted temporarily and the same exceptionmust be recognized and managed via a routine called handler ; usually the processorstate needs to be frozen by saving all the values stored within the resources inmemory or in the banked registers. Only in this way the normal execution flow canbe retaken later, at the end of the exception handling.

The processor, when entering an exception, performs the following operations:

• Preserves the address of the next instruction to be executed in the dedicatedLink Register (LR); the address stored in the register, with respect to theprogram counter address, depends on the processor state (ARM or THUMB)and on the type of exception arisen, so that the program resumes from the

58


right place when the processor ends the exception handling, with a standardassembly instruction (MOVS PC, R14 <processor mode>) and without storingfurther information.

• Copies the CPSR into SPSR <processor mode>, depending on the operatingmode the processor is entering.

• The CPSR new value is forced from the exception entered.

• The Program Counter (PC) is modified to fetch the instruction indicated bythe exception vector.

Entering an exception from the THUMB state implies the processor switchesinto the ARM state, after the exception vector address is loaded into the programcounter. The exception vectors contain the addresses of the various handler routines;the first eight memory words are reserved to these pointers, respecting the schemereported in table 3.2.

Table 3.2. Exception vectorsaddress exception mode on entry0x00000000 reset supervisor0x00000004 undefined instruction undefined0x00000008 software interrupt supervisor0x0000000C abort (prefetch) abort0x00000010 abort (data) abort0x00000014 reserved reserved0x00000018 IRQ IRQ0x0000001C FIQ FIQ

The exception handler, after the execution of a set of proper instructions, mustreturn the control to the program which was in execution and has to:

• Move the address contained in the link register (opportunely corrected via anoffset which depends on the exception occurred) into the program counter.

• Move the SPSR content into the CPSR to restore the original processor state;by this operation also the ARM or THUMB state is restored and no explicitbranch and exchange instruction is needed.

• Clear the interrupt disable flags whether they was set on exception entry;

59


3.4.1 Processor reset

An asynchronous transition from high to low on the nRESET signal forces theprocessor to abandon the execution of the program or exception handler and whenthe signal returns to a high level the processor is reinitialized to a well defined state,i.e.:

• The M-bits of the CPSR (M[4:0]) are forced to “10011”, so that the executionresumes in supervisor mode.

• R14 svc and SPSR svc are overwritten by copying the current values of theprogram counter (PC) and status register (CPSR) into them; the value ofthe saved PC and SPSR is not defined, depending on the previous operatingconditions.

• I-bit and F-bit in the CPSR are set, so that interrupt service is active.

• The T-bit within the CPSR is cleared, hence the execution resumes in ARMstate.

• the program counter (PC) is reset, so the next instruction is fetched fromaddress “0x00. . . 00”.

3.4.2 Interrupt and fast interrupt requests

The Fast Interrupt Request (FIQ) is a particular exception of the ARM processordesigned to support a data transfer or debug process with very low latency; inARM state this is ensured by a group of seven banked registers reserved to the FIQoperating mode (numbered from R8 fiq to R14 fiq), so that register saving is notnecessary.

FIQ is not the unique method to forward interrupt request to the processor,beside this a standard interrupt request (IRQ) can be exploit, but the priority ofthe latter is lower and so the FIQ mode can mask a concurrent IRQ. IRQ modebenefits of only two banked registers, R13 irq and R14 irq; the latter is used as linkregister for the exception return. FIQ exception is entered by taking the nFIQ inputlow, the standard interrupt request can be forwarded by taking the nIRQ input low.This inputs can except either synchronous or asynchronous transitions, dependingon the state of the ISYNC input signal. When ISYNC is low, interrupt requests onnFIQ and nIRQ are considered asynchronous and a cycle delay for synchronizationis incurred before the interrupt can be analyzed by the processor. An interrupthandler should leave the interrupt by executing:

SUBS PC, R14 fiq, #4 or SUBS PC, R14 irq, #4

60


The interrupt requests may be ignored by setting F-bit and I-bit in the CPSR,but these operations may be executed only in privileged mode and not during thenormal execution flow (user mode). If the F flag is clear, ARM7TDMI checks fora low level on the output of the FIQ/IRQ synchronizer at the end of each instruction.

3.4.3 Abort conditions

The abort condition happens when a memory access can not be completed forany reason and this condition must be reported by the memory management unit(MMU), which drives the ABORT input to a high level. Two different cases canoccur:

• Prefetch abort, happens e.g. in presence of an invalid fetch address.

• Data abort, happens during a memory access for data load or store.

If a prefetch abort occurs, the processor does not enter the exception immediately,but marks the instruction as invalid and switches to the abort mode when thatinstruction reaches the execution stage. If a branch instruction comes before theinvalid instruction, the processor does not take the abort exception.

The management of data abort events depends on the instruction which is beingexecuted. A data swap2 instruction (SWP) which generate an abort exception hasno effects, as the instruction had not been executed. The execution of a load (LDR)or store (STR) instruction, in presence of the exception, writes back3 the modifiedbase register and the effect must be taken in account by the handler.

If a block data transfer 4 execution generates an abort exception, the operationis completely performed but in a load operation the remaining registers are not up-dated. If the base register is not in list and the writeback is required, its value ismodified. After the solution of the problem which has caused the abort exception,the control should be returned to the normal flow by using:

SUBS PC, R14 abt, #4 for a prefetch abort, andSUBS PC, R14 abt, #8 for a data abort.

2A data swap instruction is a particular operation which swaps the values contained in a pro-cessor register and in a memory location, its usually performed for the implementation of softwaresemaphores and so is executed in locked mode.

3In exception absence, the writeback operation is performed only if expected by the addressingmode or explicitly required by the instruction syntax.

4A block data transfer operation loads or stores a number of processor general purpose registersfrom/to the memory

61


The difference in the offset is due to the processor prefetch mechanism.In the abort mode the processor uses the normal accessible registers in ARM andTHUMB state, excluding link register and program counter which are masked bythe banked registers R14 abt and R15 abt respectively; these registers are the samein both operating states.Through the abort mechanism is possible to implement a paged virtual memorysystem: in fact, in presence of an unavailable data request, the memory managementunit flags the abort to the processor, which activates the abort handling procedure.The MMU, subsequently, tries to find the required data in other memory pages tosupply the processor the wanted data. The handler must be written so that theCPU is put in a wait state, until the MMU furnishes the right data or instructionand then continues performing the instruction required.

3.4.4 Software interrupts and supervisor mode

The processor allows also the handling of software interrupts by using a dedicatedinstruction (SWI). These feature is used for entering the supervisor mode, usuallyto request a particular supervisor function, i.e. operating system operations. Theinstruction behavior is described in the section 3.5.12 and the handler should returnby using:

MOV PC, R14 svc

which is irrespective of the processor state at the exception entering, restores theCPSR and returns to the instruction following the software interrupt instruction.PC and LR banked implementation for this operating mode is also provided.

3.4.5 Undefined instruction

When the processor decodes an instruction which can not be handled, it takes theundefined instruction trap. This exception allows to avoid the system to enter anunrecoverable state, but it is also a useful mechanism that can be used to extend theprocessor instruction set by software emulation. After the instruction emulation andirrespective of the state, the trap handler should execute the following instruction:

MOV PC, R14 und

This restores the CPSR and returns to the instruction following the undefined in-struction. The undefined mode benefits only of the link register and of the programcounter in banked version, referred as R14 und and R15 und.

62


3.4.6 Exception priorities

The exception handling must take in account some rules to establish whether a re-quest or an event has the priority with respect to another. The exception priorityfollows a fixed scheme in the ARM7TDMI processor, some of them have a highestpriority:1 reset2 data abort3 FIQ4 IRQ5 prefetch abort

whereas undefined instruction and software interrupt have the same lowest prioritybut are, obviously, mutually exclusive. A particular case arise if a data abort occursat the same time as a FIQ (with fast interrupt service enabled); in this situationthe processor enters the data abort handler and then processes to the FIQ vectorimmediately. A normal return from FIQ will cause the data abort handler to re-sume execution, but placing data abort at a higher priority than FIQ is necessaryto ensure that the transfer error does not escape detection.

3.5 ARM instruction set

The following sections describe the most important characteristics of the ARM in-struction set, grouping them with respect to the functional and coding features.A first coding scheme is furnished in figure 3.6 where the instruction classes areevident.

3.5.1 Conditional execution

When the processor works in ARM state, all the instructions are conditionally ex-ecuted, by checking the condition field reported within the instruction coding andthe CPSR’s condition flags. Every instruction has a 4-bit field which express oneof fifteen possible conditions; one of these conditions avoids the condition checkingso that the instruction is “always”executed. There is a reserved condition (“1111”)which must not be used. The condition is evaluated by parsing the ALU flags de-scribed in the paragraph 3.3.5, which report the informations about the result ofthe last ALU operation executed, indicating if it was negative or zero, if a carry orborrow arose or if a overflow condition occurred. For each condition a mnemonictwo-character suffix is defined and it must be added at the instruction mnemonicto enable the optional conditional execution. The mnemonics and their meaningare reported in the table 3.3. A conditioned instruction is executed only if the con-

63


Figure 3.6. 32-bit instruction set format summary

dition is true and whether in the assembly code no condition suffix is expressed,the“always”coding is inserted by default. Every time an instruction remains unex-ecuted because of the invalid condition, it consumes a clock cycle and the controlpasses to the next instruction entering a new processor cycle.

3.5.2 Branch and exchange (BX)

The branch and exchange instruction accepts the indication of a processor registercontaining the address of the branch destination. During instruction execution thisaddress is copied into the program counter, so that, at the subsequent clock (MCLK)

64


Table 3.3. Condition code summaryCode Suffix Flags Meaning0000 EQ Z set equal0001 NE Z clear not equal0010 CS C set unsigned higher or same0011 CC C clear unsigned lower0100 MI N set negative0101 PL N clear positive or zero0110 VS V set overflow0111 VC V clear no overflow1000 HI C set and Z clear unsigned higher1001 LS C clear or Z set unsigned lower or same1010 GE N equals V greater or equal1011 LT N not equal to V less than1100 GT Z clear AND (N equals V) greater than1101 LE Z set OR (N not equal to V) less than or equal1110 AL (ignored) always

rising edge, the instruction contained at the branch destination is fetched. If thecondition is valid, the branch operation causes a pipeline flush and a refilling, sothe instructions contained in the fetch and decode pipeline registers are removed(NOP5 are inserted). If the condition is not valid the processor does not execute thebranch and waits the next clock cycle to execute the subsequent instruction. Thisinstruction permits also the instruction set exchange, by inspecting the value of theleast significant bit of Rn (Rn[0]), the processor determines whether the currentstate must be switched to ARM or THUMB state.

The assembly syntax is:

BX{cond} Rn

where the conditional (cond) mnemonic is optional. The coding format is reportedin figure 3.6, where the 4-bit fields for condition and register are shown; all theother bit are used only for the instruction encoding. The instruction takes threeclock cycles to execute, the first is non-sequential because of the jump to a newaddress, the remaining two are sequential cycles.

5NOP stands for No OPeration, is an instruction that does not perform anything leaving theprocessor state unchanged but consuming a clock cycle.

65


3.5.3 Branch and branch with link (B-BL)

Branch and branch with link instructions perform the same operation, with thedifference that the latter stores the address of the following instruction in the linkregister, to allow the execution to be retaken from the point it was left when thejump occurred. The branch destination address is relative to the instruction addressand is obtained by adding to the PC a proper calculated offset. The signed 2’scomplement 24 bit offset, expressed in the instruction assembly, is shifted left bytwo bits and sign extended to 32 bits; this quantity is added to the program counterto obtain the next instruction fetch address. Through the immediate offset, theinstruction can specify a branch of 32 Mbytes ahead or back with respect to the PC.Branches beyond this amount must use an absolute destination previously loadedinto a register, so relative addressing is allowed by performing some operations inadvance. In this case the PC should be manually saved in the link register (R14) if abranch with link type operation is required. The branch offset must take in accountthe prefetch operation, which causes the PC to be two words ahead of the currentinstruction.

Introducing the suffix“L”in the assembly instruction, the branch with link (BL)operation can be obtained, so that the processor writes the old PC into the linkregister (R14). The PC value is written into R14 when the instruction enters theexecution stage and so must be adjusted considering the prefetch operation, in fact,it does not contain the address of the instruction following the branch instructionbut a four bytes ahead value. The correction is performed via a subtraction whichuses the processor ALU, so the operation can not be done immediately, becausethe ALU is engaged to calculate the branch destination. The operation causes thejump to a different fetch address, so it takes three clock cycles to complete. Duringthe first non-sequential memory cycle the new fetch address is determined and twosequential cycles for the pipeline refilling follow. If the link option is enabled, theprocessor stores the current PC in LR during the first clock cycle and corrects itsvalue in the last cycle.

It is important to underline that the CPSR is not saved by the instruction andso can be necessary to insert a proper instruction before the branch is taken. Toreturn from a routine called by branch with link a move instruction can be used,to copy the link register in the PC if its value is still valid, or a block data transferinstruction if the link register has been saved onto a stack pointed by a register.


B{L}{cond} <expression>

where the conditional mnemonic and the link label (L) are optional. The fieldreported as <expression> can be an immediate offset expressed via a preceding“#”character or a symbolic expression including assembly code labels; the right

66


offset calculation is committed to the assembler. The coding format is reported infigure 3.6, where the 4-bit fields for condition and register are shown; all the otherbit are used only for the instruction encoding.

3.5.4 Data processing instructions

This class of operations produces a result by performing a specified arithmetic orlogical operation on one or two operands. For a simpler comprehension of codingand functionalities of these instructions the coding scheme is reported in figure 3.7.A 4-bit field within the coding selects the operation to be performed, as can be

Figure 3.7. Data processing instructions coding

seen in table 3.4. The same table reports the action executed by the ALU unit toobtain the result. Beside the well known operations, the bit clear can be underlined;

67


this a useful logical operation which permits easy masking of operands. Betweenthese operation also the MOV instruction is numbered, in fact it can also involve adata modification performed by the barrel shifter and not only a copy operation.

Table 3.4. Data processing operations summaryMnemonic OpCode ActionAND 0000 operand1 AND operand2EOR 0001 operand1 EOR operand2SUB 0010 operand1 - operand2RSB 0011 operand2 - operand1ADD 0100 operand1 + operand2ADC 0101 operand1 + operand2 + carrySBC 0110 operand1 - operand2 + carry - 1RSC 0111 operand2 - operand1 + carry - 1TST 1000 as AND, but result is not writtenTEQ 1001 as EOR, but result is not writtenCMP 1010 as SUB, but result is not writtenCMN 1011 as ADD, but result is not writtenORR 1100 operand1 OR operand2MOV 1101 operand2 (operand1 is ignored)BIC 1110 operand1 AND NOT operand2 (Bit clear)MVN 1111 NOT operand2 (operand1 is ignored)

The data processing operations can be classified as logical (AND, ORR, EOR,TST, TEQ, BIC, CMP, CMN, MOV, MVN) or arithmetic (ADD, ADC, SUB, SBC,RSB, RSC). The logical operations perform the relative operation on all correspond-ing bits of the operand or the operands to produce the result and some of them (TST,TEQ, CMP, CMN) do not write this result to a register.

If the“S”label is expressed in the instruction, the data processing operations donot affect the CPSR flags, otherwise they are modified as follows:

• The V-flag is unaffected by logical operations, but if the ALU detect an over-flow during an arithmetic operation it will be set.

• The C-flag is set if a carry or a borrow occur during an ALU arithmeticoperation, is equal to the last bit shifted out by the barrel shifter if a logicaloperation is performed.

• The Z-flag is set only if the result is all zeros.

• The N-flag is always equal to bit 31 of the result, representing the sign of theobtained value.

68


For operations which do not write a result, the CPSR flag update is implicit, also ifnot expressed in the instruction.

The“operand 2”field can be expressed in two different manners, by an immediateoperand or via a opportunely shifted registered value. The immediate operandis expressed by using an 8-bit unsigned value and a 4-bit unsigned integer whichspecifies a rotation operation on the immediate value. This value is zero extendedto 32 bits, and then subject to a rotate right by twice the value in the rotatefield. This operation is performed by the barrel shifter and enables many commonconstants to be generated, e.g. all powers of two.

The other method to express the second operand takes the value from thespecified register and performs a barrel shifter operation which is controlled bythe“shift”field in the instruction. This field indicates the type of shift to be per-formed, i.e. logical shift left (LSL), logical shift right (LSR), arithmetic shift right6

(ASR) or rotate right (ROR). Arithmetic shift left (ASL) and logical shift left (LSL)represent the same operation (the assembler assembles to the same code).

The amount by which the register should be shifted may be contained in animmediate field in the instruction or in the least significant byte of another register(ref.3.8).

When the shift amount is specified in the instruction, it is contained in a 5-bit

Figure 3.8. ARM shift operations coding

field, which accepts any value from 0 to 31; this method is referred as instructionspecified shift amount.

Some special cases are provided, in order to obtain the coding of particularbarrel shifter operations: shift and rotate operations by a null amount (which donot modify any operand) are non coded and free combinations are exploit. TheLSL#0 operation uses directly the contents of Rm as the second operand and the

6The difference between logical and arithmetic shift right is that a logical shift inserts only zerosfrom the MSB, whereas an arithmetic shift inserts a bit value in order to maintain the operand 2’scomplement notation.

69


shifter carry out is the old value of the CPSR C-flag (previous CPU cycle). Theform of the shift field which might be expected to correspond to LSR#0 is used toencode LSR#32, which has a zero result with bit 31 of Rm as the carry output.Also the logical shift right by zero is redundant as it is the same as logical shiftleft by zero, so the assembler will convert LSR#0 (and ASR#0 and ROR#0) intoLSL#0, and allow LSR#32 to be specified. The form of the shift field which mightbe expected to give ASR#0 is used to encode ASR#32. Bit 31 of Rm is again usedas the carry output, and each bit of the second operand is also equal to bit 31 ofRm. The result is therefore all ones or all zeros, according to the value of bit 31 ofRm. The form of the shift field which might be expected to give ROR#0 is used toencode a special function of the barrel shifter, called rotate right extended (RRX).This is a rotate right by one bit position of the 33 bit quantity formed by appendingthe CPSR C-flag to the most significant end of the contents of Rm.

The last mode to express the shift type and amount within a data processinginstruction is represented by the register specified shift amount, which uses the leastsignificant byte of Rs to store the shift amount. If this byte is zero, the unchangedcontents of Rm will be used as the second operand, and the old value of the CPSRC-flag will be passed to the shifter carry output. If the byte contains a value between1 and 31, the shifted result is obtained by a shift operation with the same amount.If the byte value is 32 or more, the result will be a logical extension of the shiftoperations described above, but the following rules are defined:

• LSL by 32 has result zero and carry out equal to bit 0 of Rm.

• LSL by more than 32 has result zero and carry out zero.

• LSR by 32 has result zero and carry out equal to bit 31 of Rm.

• LSR by more than 32 has result zero and carry out zero.

• ASR by 32 or more has result filled with and carry out equal to bit 31 of Rm.

• ROR by 32 has result equal to Rm and carry out equal to bit 31 of Rm.

• ROR by n, where n is greater than 32, will give the same result and carry outas ROR by n-32; therefore repeatedly subtract 32 from n until the amount isin the range 1 to 32 (a masking operations can simplify this deduction).

If the operation performed is in the logical class and if the“S”flag is set, all theshift operations (rotate included) save the last bit shifted out in the CPSR C-flag.

The instruction cycle times must take in account some hardware implementationaspects. A normal data processing operation, which does not involve the programcounter and a third register for the second operand shift, takes only one sequential

70


memory cycle. Because of the pipeline structure, if the destination register is R15(the program counter), the refill operation must be taken in account (and the furtherpipeline flush) and so the instruction needs two additional cycles to complete, onesequential and one non-sequential (a jump occurs). For reasons connected to theinternal buses structure, the ALU unit can not access three registers during the sameCPU cycle, so a further internal cycle is needed if an operation requires to read theamount of a shift operation from the register file.

The instruction syntax is quite complex and due to their nature, it is differenti-ated by groups [19]. MOV and MVN accept a single operand but it can be expressedin various manner:

<opcode>{cond}{S} Rd,<Op2>

where <opcode> is one of the mnemonics reported in table 3.4, cond is the condi-tional mnemonic (optional) reported in table 3.3, S is the optional label which forcesthe CPSR flags update, Rd is the destination register and <Op2> can be:

Rm{,<shift>} or <#immed8 r>

where Rm is an operand register, which can be optionally shifted using the as-sembly syntax <shiftop> <register> or <shiftop> <#expr>, or RRX. The field<shiftop> can be LSL (ASL), ASR, LSR or ROR, <register> represent the regis-ter containing the shift amount and <#expr> is an integer in the range {0, 31} whichrepresent the immediate shift amount and must be expressed with the“#”symbolpreceding the number. The field <#immed8 r> is another immediate expression whichthe assembler will attempt to generate by using a shifted immediate 8-bit field asreported above. CMP, CMN, TEQ, TST are instructions which do not produce aresult, so they do not accept a destination register; the assembly syntax is:

<opcode>{cond} Rn,<Op2>

where Rn is first operand register. For these operations the CPSR flags update isimplicit and can be omitted in the instruction syntax. The other operations (AND,EOR, SUB, RSB, ADD, ADC, SBC, RSC, ORR, BIC) accept the following syntax:

<opcode>{cond}{S} Rd,Rn,<Op2>

where both destination and first operand registers are expressed.

3.5.5 PSR transfer instructions

These operations are used to access directly the processor status registers CPSRand SPSR, modify their values or save them to other general purpose registers.The instructions are formed from a subset of the data processing operations and

71


are implemented using the data compare instructions without the“S”flag set. Theirencoding is quite sophisticated and is shown in figure 3.9.

Figure 3.9. PSR tranfer instructions coding

MRS instruction allows the contents of the CPSR or SPSR <mode> to be movedto a general register. The MSR instruction allows the contents of a general registerto be moved to the CPSR or SPSR <mode> register. The MSR instruction also

72


allows an immediate value or register contents to be transferred to the conditioncode flags (N,Z,C and V) of CPSR or SPSR <mode>, without affecting the controlbits7. In this case, the top four bits of the specified register contents or a 32 bitimmediate value are written to the top four bits of the relevant PSR.Some restrictions are imposed i.e., in user mode, the control bits of the CPSR areprotected from change, hence only condition flags can be changed; on the contrary, inall privileged modes, the entire CPSR can be changed. It is important to underlinethat the software must never change the state of the T-bit in the CPSR directly8,but only by using a branch and exchange instruction. The SPSR register accesseddepends on the processor operating mode during the instruction execution and noSPSR is accessible in user mode, since no such register exists.

As discussed in the paragraph 3.3.5 and in order to ensure the compatibility withfuture processors, all reserved bits should be preserved when changing the value in aPSR. This means that a read-modify-write strategy should be used to alter the con-trol bits of any PSR register, i.e. the content of a PSR register must be transferredto a general register using the MRS instruction, the relevant bits must be changedand then the modified value should be written back to the PSR register, by usingthe MSR instruction.PSR transfer instructions take only a sequential memory cycle to complete, becausethey perform the operations directly between internal registers and require no extratimes for resources access.The assembly syntax is as follows:

MRS{cond} Rd,<psr> to transfer PSR content to Rd;MSR{cond} <psr>, Rm to transfer Rd content to PSR;MSR{cond} <psrf>, Rm to transfer Rd to PSR flags only;MSR{cond} <psrf>, <expr> to modify PSR flags by an immediate value.

The conditional mnemonic is optional, Rd and Rm are expressions evaluating toa register number, <psr> can be CPSR (also CPSR all) or SPSR (also SPSR all),<psr> must be CPSR flg or SPSR flg. From the <expr>, which must be expressedby using a prefix“#”and a 32-bit representable value, the assembler will attemptto generate a shifted immediate 8-bit field (immed8 r typical form) to match theexpression and returning an error message if it is impossible.

3.5.6 Multiply and multiply and accumulate (MUL-MLA)

The multiply (MUL) and multiply and accumulate (MLA) operations are performedinternally by using the Booth’s algorithm on groups of eight bit per cycle. The

7i.e. the other bits as the mode bits, the TBIT and the interrupt disable flags.8if this happens, the processor will enter an unpredictable state.

73


ARM7 datapath provides a high speed 32x8 multiplier [7] which is depicted in figure3.10. The instructions perform the multiplication between two 32-bit operands and

Figure 3.10. LISARM 32x8 multiplier

the result is still stored in a 32-bit register. The method allows to work on operandswhich may be considered as signed (in the 2’s complement form) or unsigned integers,by the fact that the results of a signed and unsigned multiply of 32 bit operandsdiffer only in the upper 32 bits, hence the low 32 bits of two results are identical.The coding format is reported in figure 3.6. Both the operations use Rm as themultiplicand and Rs as the multiplier; the result is stored in Rd. The multiplyinstruction gives as result Rm ∗ Rs, while the third operand Rn is ignored andshould be set to zero for compatibility with possible instruction set future upgrades.The Rn register plays a different role in the multiply and accumulate instruction,which gives Rd = Rm∗Rs+Rn using Rn as accumulator. The instruction can savean explicit ADD instruction in some circumstances and it is very useful in many

74


applications.As data processing operations do, also multiply instructions the CPSR flags

update is optional; the update is controlled by the“S”bit in the instruction. TheN-flag is made equal to bit 31 of the result and the Z-flag is set if and only if theresult is zero; the carry flag is set to a meaningless value and the overflow flag isunaffected.

The instruction syntaxes are:

MUL{cond}{S} Rd,Rm,Rs and MLA{cond}{S} Rd,Rm,Rs,Rn

where Rd is the destination register, Rm and Rs are the multiplicand and the mul-tiplier respectively, and Rn is used as accumulator in the multiply and accumulateoperation. The flag S enables the CPSR flags update if present.

The algorithm used and the bit width of the multiplier (eight bits only) requiresa variable number of clock cycles to complete, depending on the number of 8-bitgroups of the multiplier which are all zero or all ones. Calling this number m, MULtakes one sequential and m internal cycles and MLA takes one sequential and (m+1)internal cycles to execute. To explain the concept, the number of internal cycles canbe:

• One if bits [32:8] of the multiplier operand are all zero or all one.

• Two if bits [32:16] of the multiplier operand are all zero or all one.

• Three if bits [32:24] of the multiplier operand are all zero or all one.

• Four in all other cases.

3.5.7 Multiply and multiply and accumulate long (MULL-MLAL)

By using the same algorithm and method discussed in the section 3.5.6, the multiplyand multiply and accumulate long instructions perform integer multiplication on two32-bit operands and produce 64-bit results. Also in this case signed and unsignedoperands are accepted, but to obtain the correct multiplication result the right in-struction must be used. The UMULL and UMLAL instructions treat all of theiroperands as unsigned binary numbers and write an unsigned 64-bit result. SMULLand SMLAL instructions treat all of their operands as 2’s-complement signed num-bers and write a 2’s complement signed 64-bit result.

The multiply forms (UMULL and SMULL) take two 32-bit numbers and multiplythem to produce a 64-bit result in the form (RdHi,RdLo) = Rm ∗ Rs. The lower32 bits of the 64-bit result are written to RdLo, the upper 32 bits of the result arewritten to RdHi.

75


The multiply-accumulate forms (UMLAL and SMLAL) take two 32-bit num-bers, multiply them and add a 64 bit number to produce a 64 bit result in the form(RdHi,RdLo) = Rm ∗ Rs + (RdHi,RdLo). The lower 32 bits of the 64-bit numberto add is read from RdLo. The upper 32 bits of the 64-bit number to add is readfrom RdHi. The lower 32 bits of the 64-bit result are written to RdLo. The upper32 bits of the 64-bit result are written to RdHi.As the data processing operations do, also for multiply instructions the CPSR flagsupdate is optional; the update is controlled by the“S”bit in the instruction. TheN-flag is made equal to bit 63 of the result and the Z-flag is set if and only if theresult is zero; the carry flag is set to a meaningless value and the overflow flag isunaffected.The instruction syntaxes are:

UMULL{cond}{S} RdLo,RdHi,Rm,Rs

UMLAL{cond}{S} RdLo,RdHi,Rm,Rs

SMULL{cond}{S} RdLo,RdHi,Rm,Rs

SMLAL{cond}{S} RdLo,RdHi,Rm,Rs

The number of cycles necessary to perform the operations can be determinedby the same considerations explained for MUL and MLA instructions (par.3.5.6).MULL takes one sequential and (m+1) internal cycles and MLAL takes one sequen-tial and (m+2) internal cycles to execute, where m is the number of 8-bit multiplierarray cycles required to complete the multiply, which is controlled by the value ofthe multiplier operand specified by Rs. Possible values are:for signed instructions SMULL, SMLAL:

• One if bits [32:8] of the multiplier operand are all zero or all one.

• Two if bits [32:16] of the multiplier operand are all zero or all one.

• Three if bits [32:24] of the multiplier operand are all zero or all one.


and for unsigned instructions UMULL, UMLAL:

• One if bits [31:8] of the multiplier operand are all zero.

• Two if bits [31:16] of the multiplier operand are all zero.

• Three if bits [31:24] of the multiplier operand are all zero.


76


3.5.8 Single data transfer operations (LDR-STR)

The single data transfer operations are a group of powerful instructions which allowmany addressing modes for memory data accessing. As ARM7 implements a load-/store architecture, these instructions, beside the block data transfer operations,are the only way to access data stored in memory. Both the instructions allow totransfer data only between a general purpose register and an external memory lo-cation and data can be word sized or also single bytes. The memory address usedin the transfer is calculated by adding or subtracting an offset from a specified baseregister. If auto-indexing is required the result of this calculation may be saved intothe base register performing a writeback operation. The coding scheme is shown infigure 3.11.

Figure 3.11. Single data transfer instructions coding

The offset can be expressed via a 12-bit unsigned binary immediate value or byusing the result of an opportunely shifted registered value (in the same way of the

77


data processing instructions but registered amount is not allowed). By default theoffset will be added to the base register Rn; to subtract it a“-”must be introducedin the instruction syntax, before the offset register or immediate value indication.The binary coding uses the“U”flag to establish if the base register address mustbe incremented or decremented. The offset modification may be performed eitherbefore (pre-indexed) or after (post-indexed) the base is used as transfer address, bya proper use of the square parenthesis in the assembly syntax; in the coding preor post indexing condition is expressed by the“P”flag. The “W”bit gives optionalauto increment and decrement addressing modes activating the writeback operation,although this bit is redundant in post-indexed data transfers, which always writeback the modified base. This has a particular meaning in privileged mode, becauseit forces non-privileged mode for the transfer, allowing the operating system togenerate a user address in a system where the memory management hardware makessuitable its use.

To define the data size for the transfer, the“B”label must be set at the endof the assembly mnemonics, so that the processor can signal it to the memorysystem. In these operations the endianness configuration plays a fundamental role,because byte load operations expect the data communication on some specific lines,which change in big and little endian configuration. The store operation, on theother hand, repeats the same byte on all the 8-bit groups of lines but, reading thememory interface dedicated signals, the memory management unit has to determinethe right location for the transfer action. Further details can be found in [16],where the complex mechanisms used in non-aligned data accesses are described.Some informations about processor behavior are discussed in the memory interfaceparagraph (3.7) and the fundamental endianness schemes are reported in the section3.3.2.

Whether during a memory access an error occurs, the MMU signal this situ-ation to the processor by using the ABORT line and the abort exception handlingprocedure must be activated. The actions performed by the processor to avoid un-recoverable state entering are discussed in the paragraph 3.4, where data abort trapis described.

The store operation takes two non-sequential memory cycles to complete: in thefirst cycle the address calculation is performed by the ALU and in the subsequentclock cycle the data is stored. The load operation takes respectively an internal,a non-sequential and a sequential memory cycle to complete: in the first cycle theaddress calculation is performed by the ALU and in the subsequent clock cycle thedata address is supplied to the MMU; in the third cycle the data is sampled on thedata bus and registered. By the fact the destination register can also be the PC, asequential and a non-sequential memory cycles must be added, in order to performthe pipeline refilling (a pipeline flush is also required). If the writeback operation isrequired, the base register is updated in the second cycle in both the instructions.

78


The assembly syntax for these instructions is quite sophisticated, because of thevarious addressing modes available. The general form is the following:

<LDR|STR>{cond}{B}{T} Rd,<address>

where cond is the conditional field, Rd is the destination or the source register, andB must be specified for a byte access. The label T, if present, forces non-privilegedmode for the transfer cycle, but it is not allowed when a pre-indexed addressing modeis specified or implied. The <address> part discriminates between the various typesof addressing modes, which are described in-depth in [19]. The target location canbe PC relative, so that the <address> field is an expression which represents anoffset the assembler uses with the PC value to generate a pre-indexed mode address.A zero offset addressing can be expressed by using [Rn] in the <address> field andother pre-indexing modes use one of the following formats:

[Rn,<#expression>]{!} or [Rn,{+/-}Rm{,<shift>}]{!}where <#expression> is an immediate offset and“!”requires the writeback operationif present. The second form accepts an optional <shift> field which must be appliedto the register content and it can be expressed by using the notation discussed inthe data processing section (3.5.4). The post-indexing modes can be specified by:

[Rn],<#expression>{!} or [Rn],{+/-}Rm{,<shift>}where the writeback operation is implicit and must not be required in the assemblyinstruction. In all the previous notations the square parenthesis contain the baseregister indication, with or without a post-indexed offset specification.

3.5.9 Halfword and signed data transfer operations

These instructions allow to load or store half-words of data and also load bytesor half-words which represent signed or unsigned values and so need to be signextended to 32-bit data. The instructions accept the same addressing modes seenfor ordinary LDR and STR operations (section 3.5.8) with little differences in theassembly syntax. The encoding scheme is reported in figure 3.12, where some furtherflags, with respect to the ordinary instructions, are evident.

The“S”flag is used to discern between signed and unsigned data and the “H”flagtells what size is the data to transfer. The same letters must be opportunely ex-pressed in the instruction syntax to allow the correct interpretation of signed valuesand data sizes. The result of the transfer operation is the sign extension to a 32-bitvalue for a signed data type or a zero filling from the left end for unsigned values.The assembly syntax is:

<LDR|STR>{cond}<H|SH|SB> Rd,<address>

79


Figure 3.12. Halfword and signed data transfer instructions coding

where H request the transfer of a halfword quantity, SB loads a sign extended byteand SH loads sign extended halfword. The last two suffixes are valid only for the LDRoperation and all the other fields resemble the standard LDR and STR assemblyscheme. Also the considerations about abort exception and instruction cycle timesare the same already view in the section 3.5.8.

3.5.10 Block data transfer operations (LDM-STM)

The block data transfer operations allow to load (LDM) and store (STM) the com-plete set of the general purpose registers or a user defined subset in memory. Theassembly syntax provides many ways to express the register list, which in the bi-nary representation are encoded with a set of sixteen flags telling if a register mustbe transferred or not. The instruction supports all the possible stacking modes :starting from the base register value, the memory address can be pre/post incre-mented/decremented so that the stack can grow up or down in the memory space.The operation is useful in such cases in which the processor registers content mustbe saved in a stack, to pass the control to a subroutine or to another process allow-ing to save many memory cycles with respect to single data transfer operations. Inthe coding format (figure 3.6) many flags are reported; their meaning is the same

80


discussed in the single data transfer section 3.5.8 for the addressing modes and write-back operation. For these operations an immediate offset can not be expressed andalso byte data transfer is not allowed. A particular meaning has the“S”flag, whichallows to execute the operations in privileged mode. The instructions are usuallyperformed on the register bank of the current state but, if the register R15 is inlist and the“S”flag is active, the LDM operation transfers the SPSR <mode> tothe CPSR at the same time the PC is transferred9 (because of this action a modechange happens). The STM instruction, in the same condition transfers the userbank registers instead of the current mode bank registers. This behavior is useful inorder to switch between processes, when user state freezing is necessary. If R15 isnot in list, both the instructions transfer the user bank registers rather than currentmode registers (the behavior for STM does not change with respect to the previouscondition).

In presence of memory access errors the behavior is the same discussed for singledata transfer operations (section 3.5.8), with some differences: the STM instructionwrites back the base register if required, hence the recovery of this situation mustbe performed by the software. The LDM instruction, after the abort exceptionoccurs, does not update the remaining registers and saves the PC content, ensuringrecoverability. The base register content is restored, in order to retry the loadoperation under the abort handler control.


<LDM|STM>{cond}<FD|ED|FA|EA|IA|IB|DA|DB> Rn{!},<Rlist> { ^ }where Rn is the base register, <Rlist> is the register list, which can be expressedby separating register names with commas or by grouping them with“-”, so thatall the included registers are transferred ((e.g. {R0,R2-R7,R10}). If the symbol“ ˆ”is present, sets the“S”bit to load the CPSR along with the PC, or force user banktransfer when in privileged mode. The“!”symbol, if present, requires the writebackoperation. The addressing modes has different mnemonics and names, which arereported in the table 3.5.

The store operation takes two non-sequential and n−1 sequential memory cyclesto complete: in the first cycle the address calculation is performed by the ALU andin the subsequent clock cycles the n registers listed are stored. The load operationtakes respectively an internal, a non-sequential and n sequential memory cycle tocomplete: in the first cycle the address calculation is performed by the ALU andin the subsequent clock cycles the data addresses are supplied to the MMU for theloading of the n registers listed. After the first two cycles, the processor startssampling the data bus and read values are registered. By the fact the destinationregister can also be the PC, a sequential and a non-sequential memory cycles must

9registers are transferred in order from R0 to R15, so the PC is always the last of them.

81


Table 3.5. Block data tranfer addressing mode namesName Stack Other L bit P bit U bitpre-increment load LDMED LDMIB 1 1 1post-increment load LDMFD LDMIA 1 0 1pre-decrement load LDMEA LDMDB 1 1 0post-decrement load LDMFA LDMDA 1 0 0pre-increment store STMFA STMIB 0 1 1post-increment store STMEA STMIA 0 0 1pre-decrement store STMFD STMDB 0 1 0post-decrement store STMED STMDA 0 0 0

be added, in order to perform the pipeline flush and refilling. If the writebackoperation is required, the base register is updated in the second cycle in both theinstructions.

3.5.11 Single data swap (SWP)

The data swap instruction is used to swap a byte or word quantity between a registerand external memory. This instruction is implemented as a memory read followedby a memory write operation which are“locked”together. This instruction is parti-cularly useful for software semaphores implementation and so the processor cannotbe interrupted until both operations have completed. A semaphore is a protectedvariable which represents the classic method for restricting access to shared resources(e.g. storage) in a multiprogramming environment. To avoid memory content mo-dification happens, during the swap operation, the MMU is warned to treat themas inseparable and so access to the memory is not allowed to other peripherals. Theexecution of a swap operation is signalled to the memory management unit usingthe LOCK processor output, which remains high during the operation completion.Via the“B”label expressed in the instruction, a byte quantity can be swapped. TheSWP instruction is implemented as a LDR followed by a STR and the action ofthese is described in the relative section (3.5.8).

If the address used for the operation is unacceptable to a memory managementsystem, the memory manager can flag the problem by driving the ABORT signalhigh. This can happen during read or write cycle (or both), and in either case, thedata abort trap will be taken (paragraph 3.4). The system software is expected toresolve the cause of the problem, so that the instruction can be restarted and theprogram execution continued.


SWP{cond}{B} Rd,Rm,[Rn]

82


where the conditional mnemonic and the byte transfer label (B) are optional, Rdis the destination register, Rm the source register (which can be the same of thedestination register) and Rn is the base register, i.e. the register containing theaddress for the memory access.

The swap instruction takes four clock cycles to complete, the first cycle is usedto access the base register (internal cycle), the memory access is performed duringthe following two cycles, the first reads the memory location (non-sequential cycle)and then the same location is written (sequential cycle). The last cycle is needed tostore the read data to the processor register and is a non-sequential cycle.

3.5.12 Software interrupt

The instruction causes the software interrupt trap to be taken and is used to enterthe supervisor mode. The instruction saves the program counter into the bankedlink register R14 svc and then forces it to the value fixed by the exception vectors(section 3.2). In order to restore the present state of the processor, CSPR content issaved into the SPSR svc register. The return instruction, as seen in the section 3.4,restores both the status register and the program counter in an atomic operation.The coding format is reported in figure 3.6, where the 24-bit comment represents agroup of ignored bits. The operation causes the jump to a different fetch address,so it takes three clock cycles to complete: a non-sequential memory cycle and twosequential cycles for the pipeline refilling. After these cycles, the control is ceasedto the exception handler.

3.5.13 Coprocessor instructions

The processor can be connected with up to sixteen coprocessors via the dedicatedcoprocessor interface, as explained in the section 3.8.Three classes of coprocessor instructions are provided by the ARM instruction set:

• Coprocessor Data Operations (CDP)

• Coprocessor Data Transfers (LDC, STC)

• Coprocessor Register Transfers (MRC, MCR)

The coprocessor data operations (CDP) are used by the processor to request theexecution of an operation by a defined coprocessor. All the coprocessor operationsmust provide the“CP#”field, which allows to select the right coprocessor by a uniquenumber. No result is communicated back to the processor, and it will not wait for the

83


operation to complete. The coprocessor could contain a queue of such instructionsawaiting execution, and their execution can overlap other activity, allowing thecoprocessor and ARM7TDMI to perform independent tasks in parallel. In this classof instructions only some bits are destined to the processor, i.e. the condition fieldand a 4-bit field which identifies the instruction itself. The remaining bits are usedby coprocessors and some field names are used by convention; all of these fields,except the coprocessor identifier, may be redefined.

The coprocessor data transfer operations are used to load (LDC) or store (STC)a subset of a coprocessors registers directly to memory. ARM7TDMI is responsiblefor supplying the memory address, and the coprocessor supplies or accepts the dataand controls the number of words transferred by a fixed handshaking protocol. Thecoprocessor reserved fields are only the “CP#”identifier and a 4-bit index whichpoints one of the internal registers. The processor has to decode all the other bitsto identify the memory addressing informations. The addressing modes availableare a subset of those used in single data transfer instructions (section 3.5.8) butpre-indexed, post-indexed and also PC-relative addressing modes are provided.The coprocessor register transfers operations (MRC, MCR) are used to communicateinformation directly between the processor and a coprocessor, transferring data fromregister to register. An important use of this instructions allows to transmit controlinformation from the coprocessor to the ARM7TDMI CPSR flags, to control thesubsequent flow of execution after, e.g., a arithmetic coprocessor operation. Asdiscussed for the data operation class, for this instructions group there are few bitsmeaningful for the processor, i.e. the condition field and a 4-bit field which identifiesthe instruction itself. The remaining bits are used by coprocessors and conventionscan be opportunely ignored, except the coprocessor identifier field.

The coprocessor instructions cycle times depend on the number of busy-waitcycles required for the execution of internal operations and must be analyzed foreach case, considering the coprocessor functionalities.

The coprocessor instructions are not implemented in this thesis work, becausethe goal of the model built is to obtain an extensible processor description, whichcan also be integrated to perform other operations, without the need of externalcoprocessors.

3.5.14 Undefined instruction

By the fact the undefined instruction is conditioned and this is useful to establishduring the program execution if the undefined instruction trap to be taken or not, sothat the relative handling routine can be activated. The execution of the undefinedinstruction involves the unrecognized instruction forwarding to the connected copro-cessors; if one of them accepts the instruction, the CPA and CPB signals are usedfor the handshaking and the coprocessor operations are activated. The instruction

84


cycle time must consider the request forward to the coprocessors and their response;in case of coprocessor absent state the control must be transferred to the exceptionhandler, so the operation takes an internal cycle, a non-sequential cycle (jump tothe exception vector pointed location) and two sequential cycles (pipeline refilling)to complete. For its particular meaning and use, the operation has no assemblermnemonics, so can not be activated by a simple code line. The coding format isreported in figure 3.6, where many“don’t care”bits are evident.

3.6 Thumb instruction set

The Thumb instruction set provides less operations with respect to the ARM instruc-tion set and every instruction is only 16-bit wide. This property increase the codedensity but has the particularity of performing all the operations on the standard32-bit registers of the architecture. In order to have less bit to encode, only eight ofthe sixteen ARM state general purpose registers are accessible to most operations,so that a register index can take three bits only. Through a special instruction, thehigh registers (from R8 to R13) can also be accessed, but for particular purposesonly.

As discussed in the paragraph 3.2, every Thumb instruction is dynamically trans-lated to an ARM instruction by the decoder and then executed. The translationguidelines are provided in [16]. It is obvious that a 16-bit encoded operation cannot include the same features of a 32-bit instruction so a reduced number of logical-arithmetic operations are allowed. Many operations use as source and destinationthe same register and immediate values are provided only for some instructions.

Sum an subtraction are encoded in a unique instruction format, separated fromother ALU operations; less ALU operations are provided by the instruction set andall the barrel shifter operations are explicitly included in this group. The branchinstructions can be in conditional and unconditional version, also long branch withimmediate offset is allowed and, obviously, the branch and exchange instruction.Memory access operations can manage every size of data but with some limitationson addressing modes. Stack pointer (SP or R13) relative load/store and push/popregister operations ease the use of stack structures, saving bits for the encoding.Coprocessor operations are not provided in THUMB state, so the use of an externalunit require a change of instruction set and processor state. Interrupt and exceptionhandling is provided also in Thumb state, and it forces the switching to the ARMstate; the software interrupt instruction is also provided so that the supervisor modecan be accessed.

85


3.7 The memory interface

The processor memory interface has a high configurability which allows to connectboth SRAM and DRAM systems, but also ROM. Beside the 32-bit address bus,a bidirectional 32-bit data bus (D) is provided and also a couple of unidirectionaldata buses (DIN and DOUT) with the same bit width. A complete set of controlsignals is provided, in order to fully exploit DRAM page mode access. The processoraccesses the memory system via four categories of memory transfer cycles:

• Non-sequential cycle, performed when the processor requests a transfer to orfrom an address which is unrelated to the address used in the preceding cycle.

• Sequential cycle, when the processor requests a transfer to or from an addresswhich is either the same as the address in the preceding cycle, or is one wordor halfword after the preceding address.

• Internal cycle, in which a memory transfer is not required because the proces-sor is performing an internal function and no useful prefetching can be doneat the same time.

• Coprocessor register transfer cycle, which allows the processor to use the databus for coprocessor communications, so that any action by the memory systemis not required.

To communicate with the memory system and coprocessors connected to the buswhich type of memory cycle the processor is executing, two signals are provided.The nMREQ output signal indicates that the processor requires memory accessduring the following cycle (memory request). The SEQ output signal will becomehigh when the address of the next memory cycle will be related to that of the lastmemory access and the new address will either be the same as the previous one orfour greater in ARM state, or greater greater in THUMB state. The SEQ signal isalso used to discriminate between an internal cycle or a coprocessor transfer cycle, ifa memory request is not forwarded by the processor. An example of the four typesof memory access timings is reported in figure 3.13.

To allow the correct use of dynamic memory systems, the processor must providethe address of the location to be accessed as early as possible, in order to permitthe longer address decoding and the generation of DRAM control signals. To doso ARM7TDMI provides pipelined and de-pipelined access capabilities via an inputsignal (APE); by this input every single memory access can be configured to accessalso mixed DRAM/SRAM memory resources.

The processor can transfer data of different size; due to the single byte addressing,words(four bytes), halfwords (two bytes) and single bytes can be accessed; the datasize of the transaction taking place is signalled by the MAS[1:0] output and the

86


Figure 3.13. Memory cycle timings diagram

position of the sub-word sized data can be inferred by the address value. When aload operation on a half-word or byte is performed, the memory system can presentthe whole word on the data bus, in fact the processor selects the required part ofthe 32-bit data. The memory management unit has to give the right interpretationto the address communicated by the processor and to the MAS signal, ignoring theleast significant bits in word and half-word accesses and presenting the correct wordbounded data. For power consumption aspects and to have a simplest decodinglogic, anyway, the MMU can only present the required part of the data withoutinterfacing problems. During the store operations of sub-word sized data, the ARMprocessor broadcasts the byte or the half-word on all the data bus, so that the samebyte is repeated within every byte boundary and the 16-bit data is replicated twotimes. In this case the MMU has to decode the least significant bits of the addressto deduce the right collocation of the data to be stored.

The instruction fetch must be done respecting the processor state, so that inARM state the whole word is read from the data bus and in the THUMB state theright 16-bit instruction is selected. For the memory system there is no differencebetween the two operating states, because the MAS signal is driven also during codesegment access and so is done for the fetch address.

To allow the memory system flag the processor about impossible data access, theprocessor ABORT input may be used; as already discussed in the paragraph 3.4, thisevent occurs on memory page faults and the data abort handler must manage thesituation to permit the MMU to retrieve the required memory page. Other signalslike nRW are driven form the processor to access the memory in read or write modeand the nTRANS output indicates whether the processor is in user or a privilegedmode; the latter information may be used to protect system pages from the useraccess, or to support completely separate memory mappings for system and user

87


modes.During the execution of a swap instruction (SWP), which allows the contents of a

memory location to be swapped with the contents of a processor register, the memorycontroller must not give access to another device, in order to prevent changing theaffected memory location before the operation is completed. The reason is that theinstruction is implemented as an uninterruptable pair of accesses to the memory: thefirst access reads the contents of the memory location addressed, the second writesthe register content to the same memory cell. ARM7TDMI drives the LOCK signalhigh for the duration of the swap operation to signal the MMU a swap instructionis being executed.

To ease the connection of the processor to sub-word sized memory systems, inputdata and instructions may be latched on a byte by byte basis. This is achieved byuse of the BL[3:0] input signals, where every bit controls the latching of one byte ofthe word present on the data bus. By using this signal a word access to halfwordwide memory must take place, obviously in two memory cycles: in the first cycle,the data of the first half-word is obtained from the memory and latched into theprocessor, when BL[1:0] are both high, in the second cycle, the other half-word islatched into the processor, when BL[3:2] are both high. Since two memory cyclesare required, nWAIT is used to stretch the internal processor clock. When accessingslow peripherals, the processor can wait for an integer number of MCLK cycles bydriving nWAIT low. Internally, this signal is put in an and gate with MCLK andallows to have an extended clock cycle, until its tied low.

By the fact that the processor is furnished of a couple of unidirectional data bus,beside the bidirectional one, is possible to interface the processor with all kind of ex-ternal memory systems. In order to respect some ASIC specifications for embeddedsystems, anyway, the bidirectional data bus can not be used and the communica-tion is allowed only on unidirectional buses. The timings of the two types of busesare identical, but they can not be enabled at the same time and the input signalBUSEN allows to select which one is to be used. To use the bidirectional databus, moreover, is necessary to establish when the processor uses it for a write cycleand when for a reading operation. Beside the nRW signal, the nENOUT output isprovided and it is tied low to indicate that the data bus is driven by the proces-sor. Whenever the ARM does not want to send data on the bidirectional bus thesignal is driven high. Another signal, nENIN, is provided to permit to drive thebus in three-state mode; the external unit ties up this input to signal that no datain furnished in input to the processor, so the bus can be put in high impedance state.

88


3.8 The coprocessor interface

The processor can be connected to a number of external coprocessors (up to sixteen)in order to extend the functionalities of the native instruction set. Every coprocessoris selected by using the dedicated 4-bit field (CP#) within the coprocessor instruc-tion and when it is not present, instructions intended for it will trap. In this case,to avoid the abnormal program termination, suitable software may be installed toemulate the functions of the missing external processor. The interfacing with thecoprocessor is obtained via three handshaking signals (nCPI, CPA, CPB) and byusing the same data bus connected to the memory system. The processor takesnCPI low whenever it starts to execute a coprocessor (or undefined) instructionand, by the fact it is connected to the data bus, each coprocessor receives a copy ofthe instruction. Every coprocessor inspects the“CP#”field, in order to establish ifit matches with his own number, in affirmative case it should drive the CPA signalline low. If no coprocessor has a number which matches the CP# field, CPA andCPB will remain high, and the processor will take the undefined instruction trap.Otherwise the processor observes the CPA line going low, and waits until the samecoprocessor ties low also the CPB signal. Only the coprocessor which is drivingCPA low is allowed to drive CPB low, and it should do so when it is ready to exe-cute the instruction. The processor will busy-wait while CPB returns high, i.e. donot execute other operations until the coprocessor has not performed the operationrequested, unless an enabled interrupt occurs. In that case it will break off from thecoprocessor handshake to process the interrupt, retaking the coprocessor instruc-tion later and repeating the operations described above. When CPB goes low forthe second time, the instruction continues to completion. This will involve datatransfers taking place between the coprocessor and either ARM7TDMI or memory,until coprocessor ceases to be busy and the CPB signal indicates the instruction iscompleted.

By the fact the coprocessors are all connected to the ARM data bus and theinstructions are fetched by the processor via the same bus, also the coprocessorscan load their internal pipeline without requiring a subsequent communication asreported above. To allow coprocessors discern between a data load or an instructionfetch, the nOPC signal is provided. To activate the communication between theprocessor and a coprocessor, a register transfer cycle must be performed, so thenMREQ signal is tied high to exclude the memory system from the transaction.

To load or store coprocessor internal registers directly in memory, via coprocessordata transfer instructions, a particular method is used. The coprocessor controlsthe number of registers to be loaded/stored and performs the operation memoryaccess via the data bus, directly with the memory system. The memory addressingand the driving of memory control signals is performed by the processor. Thecoprocessor is responsible of the number of words to be transferred and, since the

89


transaction begins, ARM increments the starting address to execute a subsequentmemory access. By driving the CPA and CPB lines high the coprocessor signals tothe processor the termination condition.For particular uses, like activation of certain units, the coprocessor can restrict theexecution of some instructions to privileged modes only. To do so, the coprocessorhas to track also the nTRANS processor output.

By the fact that undefined instructions are treated by ARM as coprocessor in-structions, all coprocessors must result absent (i.e. CPA and CPB must be high)when a real undefined instruction is presented, so that the processor will take theundefined instruction trap. To differentiate an undefined instruction from a copro-cessor one, the coprocessor need only look at bit 27 of the instruction (it is “0”forundefined, “1”otherwise). To avoid false behaviors in THUMB operating state (inwhich coprocessor instructions are not supported but undefined instructions are) allcoprocessors must monitor the TBIT output and drive CPA an CPB inputs correctlyappearing absent and the data bus must be ignored. In this way, coprocessors willnot erroneously execute Thumb instructions, and all undefined instructions will behandled correctly.

3.9 The debugging system

The ARM processor in furnished of a debug interface based on the Boundary Scanstandard (IEEE Std. 1149.1/1990), which represent an hardware extension with ad-vanced debugging features; it is intended to ease the development of applicationsoftware, operating systems and hardware which embeds the core. A typical de-bug system is made up by three parts: a host (usually a PC running a debuggingsoftware), the ARM7TDMI core and, in order to manage the communications be-tween the two parts, a protocol converter. The debug extensions allow the core to bestopped either on a given instruction fetch (breakpoint) or data access (watchpoint),but also asynchronously by a debug-request signal. Entering the debug mode thecore internal state and the system external state may be deeply examined, then thecore and system state may be restored and program execution resumed. The coreinternal state can be examined via the JTAG serial interface (TAP controller10).The JTAG system includes a set of multifunctional registers (Fig.3.15) connected toall the inputs and outputs of the processor; the registers are also serially connectedto each other by a line, so that they create a chain which ends are connected tothe JTAG controller. The scan cells can force the input and sample the outputsignals of the processor, in order to inspect the internal state of the processor, andby resulting transparent allow also the normal execution flow. To do so, the neces-sary instructions must be loaded into the TAP controller. By using the scan chain,

10Test Access Port, its functionalities are defined by the Boundary Scan standard.

90


instructions can be serially inserted into the core pipeline, without using the exter-nal data bus (fig.3.14). For example, when in debug state, a store-multiple (STM)could be inserted into the instruction pipeline and this would dump the contents ofARM7TDMI registers. This data can be serially shifted out without affecting theother parts of the system. The debug system is not fully JTAG compliant, anyway itsupports all the mandatory instructions and also other standard typical operations.

Figure 3.14. ARM7TDMI Boundary Scan scan chain

ARM7TDMI processor is provided of the EmbeddedICE (or ICEBreaker) coreextension as on-chip debug resource. It consists of two real-time watchpoint units,with associated control and status registers, as well as a set of registers implement-ing a communication channel with the debugger, referred as Debug CommunicationsChannel (DCC). The communication on the channel is obtained via the standardmethod used for coprocessors, for this reason the coprocessor identifier 14 is re-served and not available for a normal external coprocessor. Watchpoint units canbe programmed to halt the execution of instructions by the ARM core. Execution ishalted when a match occurs between the values programmed into the EmbeddedICEmacrocell and the values currently appearing on the address and/or data buses. Ei-ther watchpoint unit can be configured to be a data watchpoint (monitoring dataaccesses) or an instruction breakpoint. The ICEBreaker is programmed in a serial

91


Figure 3.15. ARM7TDMI Boundary Scan input scan cell

manner by using the same TAP controller discussed above. Via the EmbeddedICEinterface all the internal resources of the processor can be accessed and modified, sothat a complete debugging of the software can be obtained and the correct behaviorof many internal parts of the processor can be tested.

92


Figure 3.16. ARM7TDMI EmbeddedICE block diagram

93

Chapter 4

LISATek toolsuite

The diffusion of electronic devices in various aspects of the common life has deeplychanged many constraints in the industrial production. Electronics and telecommu-nications markets impose very short life cycles for products, so that time-to-marketand time-to-volume have become fundamental parameters to guarantee the economicreturn for a new device introduction. On the other hand, semiconductor technologyhas opened the horizons to the integration of great capabilities on a single chipand this, in concomitance with increasing consumer needs and the demand of newpowerful applications, has led to enormous growing in the complexity of digital de-signs. The distributed hardware approach has been abandoned in advantage of SoC1

designs, which allow to manufacture heterogeneous components and multicore inter-communicating systems on a single die. This powerful technique, anyway, requiresquite long development cycles so that the designer’s productivity has become a vitalfactor for successful products. For this reason, the idea of implement in softwarepowerful algorithms for system functions and signal processing, reducing the com-plexity of hardware development, has led to the shift from purely hardwired digitalsystems to the inclusion of programmable cores in SoC designs [11]. This strategyreduces the burden on the hardware designer, which has to optimize less aspectsof the target architecture with respect to the SoC approach and represents a newproject methodology that leads to Application Specific Instruction-set Processors(ASIPs) design. In this environment LISATek toolsuite [8] represents an innovativeapproach, because introduces automation in a development sector in which most ofthe steps are executed manually, allowing the implementation of both the processorand the toolchain for software design.

1System-on-Chip or System-on-a-Chip

94

4 – LISATek toolsuite

4.1 The ASIP design flow

The design flow of an ASIP has some main phases:

• Architecture exploration.

• Architecture implementation.

• Application software design or toolchain creation.

• System integration and verification.

During the first phase many tools are required, because, starting from the appli-cation wanted to run on the processor, an HLL2 compiler, an assembler, a linker anda cycle-accurate model of the ASIP are needed. This phase is an iterative processwhich allows to profile and benchmark the model, so that some optimizations onthe structure could be executed to fit the requirements of the target application. Itis clear that, modifying a part of the model a revision of all the software tools usedis needed and this represents a complication due to the absence of automatic tools.

Once a valid cycle-accurate model is obtained, the architecture implementationcould be started, so the functional description, often made by using an HLL likeC, has to be converted in a synthesizable description via a HDL3 as VHDL orVerilog. This is another manual phase and this approach exposes the final result toconsistency problems between HLL model and hardware model.

In the software application design phase software designers need a set of softwaredevelopment tools, substantially the same instruments used in the exploration phase,but enhanced to guarantee short times for this step conclusion. The demands of thesoftware application and the hardware processor designers are different, the latterneeds a cycle-accurate simulator for hardware and software analysis and this isinevitably slow, the former requires more simulation speed than accuracy. Thesetools need two different approaches and optimizations and also these reimplementedversions must be done manually.

The system integration and verification phase needs the realization of cosimula-tion interfaces, required to integrate both the hardware description of the processorand the software simulator into a system simulation environment, so that both de-scriptions are stimulated via the same patterns or executable files. Also in this case,an architectural modification, may require a revision of the written interface.

The effort of designing a new architecture can be reduced by using a retargetableapproach based on a description of both the platform and the instruction set. TheLanguage for Instruction-Set Architectures (LISA) was developed for the automatic

2High Level Language3Hardware Description Language

95


generation of consistent software development tools and synthesizable HDL code[9]. A LISA processor description covers the instruction-set, the behavioral andfunctional model, including the underlying hardware timings, and so provides allessential information for the generation of a complete set of development tools in-cluding C-compiler, assembler, linker, and simulator. By containing the definition ofall microarchitectural details, a LISA description allows also to generate synthesiz-able HDL code of the modelled architecture, in either VHDL or Verilog. Another keypoint of these powerful language and toolkit is that changes on the architecture areeasily transferred to the LISA model and are applied automatically to the toolchainand the hardware implementation. The LISATek toolsuite functionalities allow theautomatic toolchain regeneration also in case of upgraded processor production, sothat there is no need to rewrite them manually.The LISA statements represent an unambiguous abstraction of the real hardware[15], so a LISA model description bridges the gap between hardware and software de-sign. It provides the software developer with all required informations and enablesthe hardware designer to synthesize the architecture from the same specificationthe software tools are based on. The alternative approach to processors modelingand simulation uses HDL languages, but this way is mainly oriented to hardwareoptimization and results in many disadvantages for architecture exploration. Thesimulation of processor models written in HDL, in particular cycle-accurate models,covers many hardware implementation details which are not required to evaluatethe processor performances in cycle-based simulations and software verification, butthe real problem is that in-depth descriptions have dramatic effects on simulationspeed.

Many other machine description languages, providing instruction-set modelingcapabilities, are available and most of them exploit retargetable code generation andsimulation approaches. Some of them allow the generation of the complete toolchainfor software development and the generation of synthesizable HDL code, but requirecomplex descriptions with mixed behavioral and structural approach; anyway noneallows a simple but efficient description of pipeline operations like flushes and stallsas LISA do [12].

Another noteworthy and diffused approach is provided by Tensilica, which holdsa good market share, especially in mobile phones applications. Xtensa system [10]allows to customize and extend a RISC processor, which has a number of base in-structions, via a retargetable tool suite; by the fact it is based on a architecturetemplate, this design method has the disadvantage of generate not too much opti-mized hardware for highly application-specific processors or very simple platforms.Starting for the aforementioned considerations we claim that the LISATek toolsuiterepresent a fully retargetable approach for architecture exploration, software toolsdevelopment, architecture implementation and system verification and integrationfor a wide range of processor architectures, from very essential processors to pipelined

96


processors, superscalar architectures, single instruction multiple data (SIMD) andalso VLIW processors.

Figure 4.1. The ASIP design flow

4.2 Architecture exploration

The exploration of the processor architecture starts from the analysis of the algo-rithms which must be executed on the programmable platform [13]. The develop-ment of these algorithms is beyond the LISA platform scope and is usually done byapplication-specific tools which focus on the system-level design. Often the result ofthis process is a pure functional specification represented by an executable prototypewritten in a high level language like C or C++ and accompained by requirementslike cost and performance parameters of the desired system.

The following step is deriving the figures of the most computational intensiveblocks of the whole system and this task can be easily performed with a standardprofiling tool4. This tool makes possible to extract some fundamental statisticsduring the simulation of the functional prototype. This procedure allows the designer

4a profiler is a performance analysis tool that measures the behavior of a program as it runs,particularly the frequency and duration of function calls; the output is a stream of recorded events(a trace) or a statistical summary of the events observed (a profile).

97


to focus the performance critical parts of the application code and therefore allowsa correct approach to define the data path of the programmable architecture whichneed more care, working on the assembly instruction level.

An easy way to begin the model development is to pick a simple LISA proces-sor model (like one of the example projects furnished with the tool or the tutorialproject), which implements a basic instruction set, and then modify it by enhancingresources and creating new special-purpose instructions, in order to improve theperformance of the considered application. Via this method the most complex andcritical parts of the target application code are translated into assembly by makinguse of the specified special purpose instructions. By using assembler, linker and pro-cessor simulators derived from the LISA model, the designer can iteratively profileand modify the programmable architecture running the selected application, untilit fulfills the performance requirements.

When analysis and optimization of the application critical parts is completed, theinstruction set needs to be extracted in order to allow the execution of all the otherparts. These parts have usually little effects on the overall performance, therefore itis very often feasible to employ the HLL C compiler derived from the LISA modeland accept suboptimal assembly code quality, in return of a significant cut in designtime.

Other optimizations can be performed on the microarchitecture by improvingmicrocode efficiency, not only with respect to the software related aspects, but alsowith regard to hardware behavior. For this purpose, the LISA language providescapabilities to model cycle-accurate behavior of pipelined architectures. The LISAmodel is supplemented by the instruction pipeline and the execution of all instruc-tions is assigned to the respective pipeline stage, so the designer is able to verifythat the cycle true processor model still satisfies the performance requirements. Atthe last stage of the design flow, the HDL generator allows to generate synthesizableHDL code for the basic structure and the control path of the architecture, but also toimplementing the dedicated execution units of the data path. Futher informationson hardware cost and performance parameters (e.g. design size, power consumption,clock frequency,. . . ) can be derived by running the HDL processor model throughthe standard synthesis flow. On this high level of detail, the designer can tweak thecomputational efficiency of the architecture by applying different implementationsof the data path execution units directly on the LISA model files instead of findsuboptimal solutions acting on the HDL implementation obtained by the automatictool.

98


4.3 The architecture description: the LISA lan-

guage

The LISA language is an Architecure Description Language (ADL) which inheritmost of the C language characteristics and aims to the formalization and the de-scription of programmable architectures and their interfaces. The principal purposeof this description language is to close the gap between hardware description lan-guages (HDL) and the languages oriented at the instruction sets development. ALISA description is essentially made up by resources and operations descriptions.The resources represent the storage objects the processor can count on, i.e. general-purpose and dedicated registers, pipeline registers, memories and cache memories,which can capture the system state. The other description elements are the op-erations, which are intended to describe the behavior of the architecture, from thedecoding of the instruction to their execution step by step, but also the structure ofall the processor parts.

The LISA approach to modeling the various parts of the architecture divides theproblem in the following conceptual parts:

• Memory model

• Resource model

• Instruction-set model

• Behavioral model

• Timing model

• Microarchitecture model

These can be considered as model components and for their realization a series ofinformation and properties must be obtained by the target architecture specificationand also from other components of the same model, as depicted in the diagram (4.2)and in the figure(4.3).

Each of the model components is described via dedicated LISA instructions, asis briefly reported in the following subsections.

4.3.1 Memory model

The memory model is substantially a list of all the registers and memories thesystem is provided. The description includes their respective bit widths and rangesor, indirectly, these parameters are defined by using C languages built-in data types.A useful characteristic of the LISA language is the aliasing of some resources, so that

99


Figure 4.2. LISA processor model parts

Figure 4.3. LISA model parts and file sections

a storage object, or part of it, could be referred by a common architecural name (e.g.the program counter, usually called “PC”). In this section the memory configurationmust be provided, to allow the correct object code linking. During simulation, theentirety of storage elements represents the state of the processor and all the statesof memory elements can be displayed in the debugger. The HDL code generatorderives the basic architecture structure by the definitions of this section, so that the

100


memory model can be built. The resources, that can be declared into a memorymodel description, are:

• Simple resources, such as registers and register files, signals, flags and idealmemory arrays.

• Pipelines structures for instruction and data paths.

• Pipeline registers storing data for the shifting from one pipeline stage to thenext.

• Non-ideal memories, such as caches, nonideal RAMs, buses (as part of thememory subsystem).

• Memory maps for processors which use more than one memory, to obtain thecorrect addressing of the single resources.

For memory system description many parameters can be defined within the LISAfiles, ranging from size, subblock and endianness organization of simple memories toaccessibility of the resources (read, write and execute permissions) and from staticor dynamic RAM timing parameters to cache access policies. Powerful operationsare furnished for the management of the memories with respect to the abstractionlevel chosen for the model description, so that the access to the memory resourcescould be done via a purely functional method or by using cycle-count accurate orcycle-based techniques.

The cycle-count accurate memory simulation allows a very easy modeling of thememory hierarchy and has the capability to provide the user with statistics andprofiling data about the resource utilization.

In cycle-based memory access, the read and write operations are implementedvia requests to the respective memory (or bus) module, so that the operation willbe executed only when the module is not busy and within a certain number ofclock cycles by the request is accepted. Also in this case statistics and profilinginformations can be collected for performance analysis.

4.3.2 Resource model

The resource model describes the available hardware resources and is obtained eval-uating how the operations accesses to the elements of the memory model. Resourcesreflect properties of hardware structures, that can be accessed exclusively by one op-eration at a time. The instruction scheduling of the simulation compiler dependson this information and the HDL code generator uses this information for resourceconflict resolution. Besides the definition of all objects, the resource section in aLISA processor description provides information about the availability of hardware

101


resources. In older versions of the toolsuite the behavior section within LISA op-erations was provided by a header of the behavioral section with the indication ofwhich resources the operations needed to use and the information if the used re-source was read, written, or both. Last versions of the LISATek platform don’t needthis indication, because the various tools parse the operations behavior statementsto trace every access to processor resources.

4.3.3 Instruction-set model

The instruction-set model identifies valid combinations of operation codes (opcodes),registered or immediate operands and other parameters which define the operationsthe architecture must be able to execute. It is expressed by the assembly syntaxand by the instruction-word coding, and these specifications define the set of legaloperands and addressing modes for each instruction. Via this model compilers andassemblers can identify instructions and the same information is used during thereverse process of decoding and disassembling.

The specification of the instruction-set model is done by a couple of sectionsdefined within the LISA operations:

• The CODING section describes the sequence of binary values which definesthe instruction word.

• The SYNTAX section describes the notation of mnemonics and assembly syn-tax of instructions, operands and execution modes; perhaps it may contain afield for the conditional execution.

These two sections are deeply linked each other by a certain number of fields andidentifiers, which refer the same object in both different domains, but also within thedefinition of the microcode (behavioral section). The DECLARE section containslocal declarations of identifiers and other references for the immediate execution oractivation of other LISA operations. The LISA language supports a very usefulmechanism which enable the hierarchical structuring of the operations via a seriesof cross references between an operation and the others; a LISA code example isreported in figure 4.4. By these references is possible to create a LISA operationusing other already defined operations and this is an optimal approach for twodifferent aspects: the first, because the mechanism allows to simplify the behavioraldescription of an operation by breaking down the problem and writing the necessarymicrocode on a number of operations; the second, for the reason that some pieces ofmicrocode could be shared between various operations and so the hardware circuitswhich implements the same behavior. The hierachical organization of the operationsis realized using the the two sections described above, so that, part of the instructionsyntax and of the coding scheme, call other operations creating a tree structure of

102


OPERATION arith_logic_grp IN pipe.DC

{

DECLARE

{

GROUP opcode = {AND_dc || EOR_dc || SUB_dc || RSB_dc || ADD_dc ||

ADC_dc || SBC_dc || RSC_dc || ORR_dc || BIC_dc};

GROUP cond = {EQ_dc || NE_dc || CS_dc || CC_dc || MI_dc || PL_dc

|| VS_dc || VC_dc || HI_dc || LS_dc || GE_dc || LT_dc || GT_dc

|| LE_dc || AL_dc || unc_dc};

GROUP S = {PSR_no_update_req || PSR_update_req};

GROUP Rn, Rd = {reg_index};

GROUP operand2 = {shifted_reg_operand || immediate_operand};

}

CODING{cond 0b00 operand2=[12..12] opcode S Rn Rd operand2=[0..11]}

SYNTAX{opcode~ cond~ S~ " " Rd~"," Rn~ "," operand2 }

...

}

Figure 4.4. Syntax and coding sections example.

references (in some cases of cross-references). This technique is fundamental forthe decoding and the scheduling of the operations and is referred as coding rootoperation; an example of this scheme, is reported in fig. 4.5. Starting from thebynary sequence fetched from the memory and stored in the instruction register, theparsing of the various fields is executed and the operations matching their coding aresubsequently activated or called. This section is directly used by the assembler andthe disassembler, by the simulator and is fundamental for the Processor Generatorwhich has to generate the instruction decoder.

4.3.4 Behavioral model

The behavioral model defines the set of operations the hardware has to performto execute the selected instruction. The abstraction level of this model can in-clude the hardware implementation level and the higher level of C statements. TheBEHAVIOR and EXPRESSION sections within LISA operations are parts of thebehavioral model and the C code parts are executed directly during simulation. TheEXPRESSION section is useful to return operand values, register indexes or other

103


Figure 4.5. A coding root example

resources references, execution modes used in the context of operations. The partic-ularity of accepting arbitrary C code permits to perform function calls to externallibraries that can be linked to the executable software simulator. Within the beha-vioral description of a LISA operation all the processor resources are visible but alsolocal variables can be used, but there isn’t the possibility of returning parametersas in common C specification.

4.3.5 Timing model

Focusing on the implementation of the architecure, the timing model specifies theactivation sequence of hardware operations and units, for the execution of the code-word loaded in the instruction register or for the instructions stored in the variouspipeline stages. The instruction latency information lets the compiler find an appro-priate schedule for the operations and provides timing relations for their executionduring the simulation and hardware implementation. Several parts within a LISAmodel contribute to the timing model, the declaration of pipelines and their stagesin the resource section, the operations assignment to pipeline stages and, more ex-plicitly, the ACTIVATION section in the operation description, used to activateother operations in the context of the current instruction. The activated opera-tions are launched as soon as the instruction enters the pipeline stage which areassigned. In presence of non-assigned operations their execution is performed inthe pipeline stage of their activation. The predefined functions as stall, shift, flush,insert, and execute, which are automatically provided by the LISA environment for

104


each pipeline declared in the resource section, also have effects on the activationof the operation, introducing required delay and respecting all the pipeline stagesstatus. All these pipeline control functions can be applied to single stages as wellas whole pipelines. Using this very flexible mechanism, arbitrary pipelines, hazardsand mechanisms like forwarding can be simply modelled in LISA.

4.3.6 Microarchitecture model

The microarchitecture model allows to define some groups of LISA operations so thatthe architecture description in HDL implements their functionalities in a unique unit.By this description the desired implementation of the microarchitecture could beobtained and the generated HDL will contain different structural components such asdecoders, ALU, barrel shifters or other units defined in separated files. The operationgrouping must be done by listing the LISA operations which must included in a HDLcomponent into a UNIT section, the name of the section will be the name of theHDL relative file. This method is useful to identify the hardware organization alsoin the post-generation phase, for data path optimizations or verification purposes.

4.4 The LISATek model development tools

Beside the LISA language description, the LISATek toolsuite provides a series ofuseful GUI5 based tools aimed at simplify the processor modelling and debuggingsince the first development phases. These tools are:

• Processor Designer

• Instruction-set Designer

• Syntax Debugger

and are synthetically described in the following sections.

4.4.1 The Processor Designer

The Processor Designer represents the principal instrument for the processor modeldevelopment, it is essentially made up by a GUI which allows to create and main-tain all the parts of the project, from LISA files to C-language headers and libraries.By using its intuitive interface, various directly linked LISATek tools and severalmenus and commands, is possible to build and configure depending toolsuite parts,like HDL generator, simulator and other software tools generation processes. The

5Graphical User Interface

105


most important part of the Processor Designer is the LISA language compiler anddebugger, which provide a global view of all the files belonging to the project andthe configuration files; a text editor with many functions oriented to files writing andmaintenance is also provided. The generation flow can be controlled by dedicatedbuttons, from the implementation of single tools to the complete set of necessaryinstruments; a useful window reports all the messages about the operations per-formed by the application and also compilation errors with detailed references. Theaccess to the product documentation completes the functions of this fundamentalcomponent of the toolsuite.

Figure 4.6. Processor Designer screenshot

4.4.2 The Instruction-set Designer

The Instruction-set Designer represents an alternative method for the descriptionof the processor instruction set, which uses the LISA language files, but also anoptimal system to obtain a graphical scheme of the instruction set modelled. Viathis tool instruction set inconsistency could be shown and correct, by analysing allthe fields componing a single instruction in a very flexible graphical window. ISA

106


instructions can be created, removed, modified and deeply examined using this tool,so it could be seen as a complementary method to the textual description performedby LISA description. The Syntax Debugger can be started by the Processor Designerenvironment and allows to operate on the relative project.

Figure 4.7. Instruction-set Designer screenshot

4.4.3 The Syntax Debugger

The syntax debugger can be started by the Processor Designer environment andallows to introduce single assembly instructions or pieces of code and starting astep-by-step recognition of the tokens written. The functions performed by the toolare very similar to the assembling process, but the Syntax Debugger is providedby a window which explain the assembly instruction encoding via LISA operationtracing and via another window a sort of report is posted, with all the referencesabout the parts of the instruction recognized and its bynary coding. The purposeof this part of the LISATek toolsuite is oriented to the debugging of the instructionset described within the model which project is opened in the Processor Designer.

107


Figure 4.8. Syntax Debugger screenshot

4.5 The architecture implementation

The principal guideline for the development of an ASIP is to obtain a highly opti-mized architecture for a specific application, so the automatically generated HDLcode has to fulfill tight constraints, to represent a valid alternative to totally hand-written HDL code and to avoid many steps for refinements. Some critical aspectsof the target hardware, as power consumption, chip area and computation speed,represent the most difficult challenges for these class of architectures and some oftheir parts need to be optimized manually, particularly for data paths. Anyway, theLISATek Processor Generator, generates well optimized parts of the processor asregister files, pipeline structures and registers, the pipeline controller, the instruc-tion decoder and all the control signals for functional units activation.

By using the LISA operation grouping capabilities, is possible to obtain an or-ganized set of HDL files, which represent the different parts of the architecture andthis parts can be accessed and modified to improve and optimize the hardware de-scription, in order to obtain the desired performance. The hardware description

108


of a functional unit, in fact, is generated taking into account the behavioral state-ments expressed in the relative LISA operations, and the obtained HDL code is abehavioral-style translation of the C statements. The burden of create optimizeddata paths is postponed to the HDL synthesis step and so all the tests on theirperformance parameters. This represent a limitation of the approach based on theLISATek toolsuite as discussed in the previous paragraphs.

The LISATek Processor generator generates the HDL description generation viathe following main phases:

• Analyses the resource section of the LISA description to get informations aboutthe main structure of the architecture, as pipeline stages and storage resources,and instantiate the components required (registers, pipeline registers, memo-ries, caches, intercommunication buses, input and output ports and all con-trollers with relative control signals for these parts).

• By analysing the grouping of the operations to functional units within the LISAdescription, the global structure of the HDL description can be instantiated,this phase allows to identify well defined blocks within the architecture forfurther manual optimizations, but units are also automatically grouped inthe pipeline execution stages as fetch, decode, execute and writeback stages.Since in LISA is possible to assign hardware operations to pipeline stages,this information is sufficient to locate the functional units within the pipelinewhich they are assigned and this selection is done in any case.

• Generation of the instruction decoder derived from information in the LISAmodel reflecting the coding of the various instructions. Depending on thestructure of the LISA architecture description, decoder processes are gener-ated in several pipeline stages. The specified signal paths within the targetarchitecture can be divided into data signals and control signals. The controlsignals are a straight forward derivation of the operation activation tree, whichis part of the LISA timing model. The data signals are explicitly modelled bythe designer by writing values into pipeline registers and implicitly fixed bythe declaration of used resources in the behavior sections of LISA operations.

For simulation and verification purposes, also the SystemC description of thearchitecture could be obtained, via a specific generation tool. SystemC is oftenthought of as a hardware description language like VHDL and Verilog, but is moreaptly described as a system description language, since it exhibits its real powerduring transaction-level and behavioral modeling, in fact SystemC is a set of libraryroutines and macros implemented in C++, which makes it possible to simulateconcurrent processes, each described by ordinary C++ syntax.

109


4.6 The application software design

The possibilities to generate automatically HLL C compiler, assemblers, linkers, andISA simulators from LISA processor models allows the designer to explore all theaspects of the design very quickly, testing functionalities, discovering bugs or opera-tions needing further optimizations. In this section, specialities and requirements ofthese tools are discussed.

4.6.1 Assembler and linker

The LISA assembler, lasm, processes the assembly source code files and transformsthem into linkable object code6 for the target architecture. The base for this tool im-plementation is the instruction-set information defined within the LISA descriptionof the processor. Besides the processor-specific instruction set, the generated as-sembler provides a set of pseudoinstructions, usually called directives, useful for thecontrol of the assembling process and for the initialization of the data the programwill work on. Some directives allow the grouping of assembled code into sectionswhich can be positioned separately in the memory by the linker. Symbolic identifiersfor numeric values and addresses are standard assembler features and are supportedas well, moreover, also macro-assembler functionalities are implemented, so macrodefinition and recall is supported by the tool.

After the assembling phase of the source code, the object file contains a numberof symbols, i.e. some references to local and global routines stored elsewhere in thememory space. To obtain a unique object program is necessary to resolve thesereferences retrieving the code lines to be linked to the main part of the application.The linking process, performed by the LISA linker llnk, is controlled by a linkercommand file that keeps a detailed model of the target memory environment and anassignment table of the module sections to their respective target memories. Thisfile has extension ”cmd” and a reference must be put into the makefile7. The LISAlinker allows to use external memories, i.e. separated from the architecture model,so that the code should be correctly linked directly there.

4.6.2 Disassembler

The disassembler (ldasm) executes the opposite work done by the assembler andthe linker, accepting the linked object file as input, i.e. the executable application,and transforming it in an assembly file. Also for the implementation of this toolthe instruction-set specifications reported into the LISA model are used and the

6code directly executed by a computer’s CPU, machine code7The makefile contains all the informations for the assembling and linking of the source assembly

file

110


resulting disassembled file allows to check of the correct assembling and disassem-bling operations. The disassembled file reports a different form of the symbols andmemory addresses defined into the source assembly file, due to the fact that thisrepresentation is done via the point of view of the memory system the processorhas, so all the relative references written in the source files are substituted by thisabsolute addresses.

4.6.3 Simulator: the “Processor Debugger”

The simulator is the tool which allows to understand how the model described workson the program given in input. The simulator is made up by a number of C++ fileswhich describe the processor model behavior and the simulation control is allowedby using the GUI of the LISATek Processor Debugger. The debugger accepts anobject file as input and allows to monitor various aspects of the model, like registers,memories, internal signals, pipeline behavior and events; by means of some windowsreporting the disassembled code of the object file in execution, the original assemblyfile and the microcode of the single LISA operations in execution is possible to checkthe effects of every statement on the state of the processor. As a standard debugger,the tool is furnished of a number of commands for the microcode execution control,for step by step execution, breakpoint insertion and so on. The pipeline controlis particularly rich of informations about timing and stage status and allows theprofiling of the relative structure. Because of the different abstraction level thedesigner can exploit in the development of the model, the LISA simulator can begenerated using several techniques to achieve more flexibility or higher simulationspeed.

With respect to some model description aspects, as simulation purposes (de-bugging, profiling, verification), architectural characteristics (instruction-accurate,cycle-accurate) and target software applications (DSP kernel or operating system),the most correct trade-off between performance and flexibility could be reached andthe selection of the proper simulation technique could be chosen.

The interpretive simulation offers more flexibility for testing operations and is theonly one can be performed on every kind of architecture the LISA description allowsto obtain. The disadvantage of this type of simulator is relative to the computationspeed, because every aspect of the model behavior must be interpreted from the fileswhich describe the architecture and the relative C code for the simulation must begenerated step by step, without using compiling methods.

An alternative technique, to the other extreme, is the fully compiled simulation,which allows to increase throughput and so the simulation speed by using a multi-step method for the LISA operations execution. The first of this steps is instructiondecoding in which machine instructions, operands, and addressing modes are recog-nized for each instruction word of the input object file, with the particularity that

111


every repeated instruction is decoded once also for operations inserted in loops. Bythis method the decoding operation could be omitted at runtime, reducing the sim-ulation times. The subsequent phase is the operation sequencing, which determinesall the operations required for the execution of every instruction found in the appli-cation program. These operations are organized in a table with an index useful forthe operation call at runtime; thus, when an operation is called the behavioral partof the LISA operation is executed. The last phase performed by the simulator isthe operation instantiation and the simulation loop unfolding in which the operationscheduling, that is the operation execution timing, is determined. In this step thebehavior of the LISA operation is executed by calling the functions identified in theprevious phase and the loops retrieved within the code are unfolded to drive thesimulation in the next state. This simulation method, anyway, can be applied onlyto instruction-set accurate models or cycle accurate models without an instructionpipeline and under the assumption of a constant program memory.

Besides the fully compiled simulation, described above, there are other simulationtechniques which perform some of the required steps in different moments, also atthe runtime, reducing the compilation time in disadvantage of the global simulationtime. The dynamic scheduling method reduces the simulation time by executingthe instruction decoding and the operation sequencing at the compile time, butthe operation scheduling is done during the simulation. This technique can not beused in presence of external memories or self-modifying program code for obviousreasons, so is less flexible with respect to the compiled simulation, but allows asimpler approach for the pipelined architectures simulation. The pipeline control isdescribed via a series of dedicated operations in the LISA files and the activationof these operations at the runtime modifies the simulation state because of dataand control hazards occurrence. This behavior is very difficult to be predicted soall the operation are inserted and removed dinamically in the pipeline, althoughthe disadvantage of the continuous maintenance of the pipeline that becomes veryexpensive to be done at the runtime.

To avoid this time-intensive step at runtime, a static scheduling approach couldbe applied. This is a very sophisticated method because, starting from the analysisof the present state of the processor, particularly the pipeline state, is necessary togenerate all the behavioral code of the LISA operations encountered in the programand for all pipeline possible events, such as flushes, stalls, normal shift, possiblehazards, operation activation and removing in all the pipeline stages. This techniquerequires much time for the generation of a number of code lines for the simulatorimplementation, so before the simulation starts, but less time for the execution ofthe simulation with respect to the dynamic scheduling method.

The last variant of simulation technique is the Just-In-Time Cache CompiledISA Simulator, which is based on some of the aspects already seen above. The

112


basic idea of the JIT-CCS is to memorize the information extracted in the decod-ing phase of an instruction for later re-use in case of repeated execution. Althoughthe JIT-CCS combines the benefits of both compiled and interpretive simulation, itdoes not achieve the performance of statically scheduled simulations but it has someadvantages. The JIT-CCS exclusively incorporates dynamic scheduling for instruc-tion decode/compilation and since instruction decoding is performed on instructionregister contents at simulator run-time, the usage of external memories is supportedand program memory changes will be honoured.

Figure 4.9. Processor Debugger screeshot

4.6.4 The C-Compiler

From the LISA description of the processor the generation of the C compiler couldbe retrieved. To do so the CoSy Compiler Development System is utilized, whichfollows a modular, engine based concept to perform parsing and semantic analysisof the input files in the front end, optimizations and transformations of the com-piler’s intermediate representation and code generation in the compiler backend.

113


The compiler’s backend engines are generated by a tool which reads the so calledcode generator description (CGD) files and generates code selector, scheduler, andregister allocator. The creation of the CGD files from the LISA processor descrip-tion is based on the LISA processor compiler. This tool reads the LISA processordescriptions and generates all software tools and the hardware model. The partsof the CGD description can be automatically generated from the LISA processordescription. For the generation of the C Compiler the LISATek Compiler Designertool is provided, which allows to describe the processor resources the compiler canuse for optimal generation of the object code. Before the generation of the compilersome informations must be given to the Compiler Designer, like how to use processorregisters, the specification of data and stack layout, some scheduler directives andother specifications and language conventions. The obtained compiler lcc is ableto accept c-language input files generating object files directly executable on theprocessor model by the other simulation tools.

4.7 The system integration and verification

The system integration and verification phase includes the major task to evaluate thetrade-off between various functionalities (e.g. speed, size, and power consumption)to determine which part of the overall system functionality should be implementedin software and which must be implemented in hardware (hardware-software parti-tioning). On the other hand, the verification of the obtained hardware descriptionmust be performed and again cosimulation interfaces are required, to integrate boththe hardware description of the processor and the software simulator into a systemsimulation environment, so that twice the descriptions are stimulated via the samepatterns or executable files. Obviously, some changes in the modelled architecture,could have effects on the interfacing system, so all these parts need to be regeneratedto grant the correct communication and functionalities of the cosimulation system.The LISATek toolsuite includes the System Integrator Platform which allows sys-tem integration and verification capabilities but also the possibility of integrate themodel into the context of the whole system (SOC) which includes a mixture of dif-ferent embedded processors, memories, and interconnect components.In fact verification of the complete system, including software, has become the crit-ical bottleneck in the design process of SOC. This is due to the diversity of em-bedded components sources, hardware-software cosimulation exploits hardware andsoftware design techniques that incorporate the use of various languages (VHDL,Verilog, C/C++), formalisms, and tools.In order to support the system integration and verification, the LISATek systemintegrator platform provides a well defined application programmer interface (API)

114


to interconnect the instruction-set simulator generated from the LISA specificationwith other simulators. The API allows to control the simulator by stepping, run-ning, and setting breakpoints in the application code and by providing access to theprocessor resources as the LISATek Debugger does.

115

Chapter 5

The LISARM model

This chapter describes how the ARM processor instruction set has been modelledusing LISATek tools and the LISA 2.0 language. The processor behavior is describedby using only the LISA language and the code produced is organized in different files.The Processor Designer [20] and Debugger [22] tools were used exclusively for thecode drawing up and testing, while Syntax Debugger and Instruction-set Designerplayed a reduced role in some consistency checking steps. Fundamental, to explorethe model behavior and the LISA capabilities, were the automatically generatedsimulator and assembler; by using them the model were constructed step-by-step,alternating the code writing with its testing.Many powerful features offered by the language were used, in order to enable theLISA operation reuse and to obtain a noteworthy level of hardware optimization.Because of some capabilities, supported both by the LISA language and the sim-ulator but not by the HDL Generator, problems occurred in the creation of theunidirectional and bidirectional data bus which ARM provides. This is due to somememory management operations not yet supported for hardware implementation.The described processor respects the instruction cycle times described in the para-graph 3.5 or in [16], the memory interface features (section 3.7) are also compatibleand the datapath structure is quite similar to the original one. The whole model isdescribed in the following paragraphs.

5.1 The model structure

The ARM instruction set and all the processor parts behavior is described by usingthe LISA 2.0 language and the model is organized by dividing the produced code inthe following files:

• main.lisa

116

5 – The LISARM model

• arm.h

• conditionfield.lisa

• alu operations.lisa

• barrel shifter operations.lisa

• multiplier.lisa

• branch instructions.lisa

• data proc instructions.lisa

• mem access instructions.lisa

• multiply instructions.lisa

• misc ops.lisa

• other instructions.lisa

The main.lisa file contains the fundamental processor resources and featuresdefinition and the principal LISA operations which describe the model behavior atevery new clock cycle, in presence of a reset signal or when an interrupt requestarises. Some of these parts are discussed in the section 5.1.1, while the model mainoperation is described in the paragraph 5.1.2. The decoding mechanism and the in-struction set description method are discussed in the section 5.1.3, while the variousinstruction behavior is described in the subsequent paragraphs, group by group.The arm.h file is a standard C header file which contains many definitions of enumer-ative types and labels, like all the operating states names, processor states, barrelshifter and ALU operations, ARM assembly condition mnemonics. Here are alsodefined a number of constant masks, used for sign extensions, arithmetic and logicaloperations, subword sized data types for memory access. Some further define dir-ectives are used to simplify the access to PSRs flags and external signals assignmentby LISA predefined methods.In the header file all the exception vectors are defined (ref. section 3.4).

5.1.1 Processor resources, interface, internal units

The processor resources are defined in the LISA model by using some specific mech-anisms and dedicated data types; all their definitions are reported in the main file(main.lisa), where many features as memory model, internal signals, interfaces,

117


registers and pipeline structure are defined. Starting with the memory model, aunique memory is defined, by the fact that ARM adopts a Von Neumann architec-ture, thus it does not support more than one memory area or separated memoriesfor code and data. The memory is implemented as a collection of 32-bit unsignedinteger data and a 32-bit address bus is defined. The memory model definition rep-resents a problem for the ARM description, because fetch 16 or 32 bit instructionsmaintaining the single byte addressing capabilities is not possible within a LISAprocessor model, where at every program counter increment, a new 32-bit instruc-tion is fetched. This problem is described in the section 6.2, where some guidelinesfor the memory wrapping are furnished.The Instruction Set Architecture adopts a little endian organization that can notbe dynamically changed during the execution flow, because this feature is not sup-ported by LISATek tools. This is another issue which has to be resolved by anexternal interface if big endian organization is needed.

To define the processor operating state and mode, two global variables are used:processor state and processor mode. The main operation reads their content atevery execution cycle and updates the mode bits and the T-bit within the CPSR.By the fact the Thumb instruction set is not implemented yet, the T-bit is ignoredby all the LISA operations and so the execution is always performed in the ARMstate. The processor mode variable, otherwise, is used every time a banked registeris read or written, in order to select the bank to be accessed. The variable valuesare assigned in the behavior section of various operations by using the definitionsreported in the arm.h header file, where ARM operating mode identifiers [16] arerespected. The current processor status register (CPSR) and the banked savedPSRs are implemented by using a 32-bit CXBit data type, a particular LISA datatype which allows to perform extraction and modification of single bits by usingspecific predefined methods. The register file and the other banked general purposeregisters are defined as integer data type variables, but in the execution units theirvalues are assigned to other CXBit types to enable the single bit accessibility. TheLISA language executes type conversions by using very flexible functions, while somecastings are implicit and should be omitted. All PSRs and GPRs data types aredefined as clocked registers, to ensure their values are updated maintaining the realarchitecture timing and behavior.

The pipeline structure is described in a very simple manner with LISA, listing thesingles stage identifiers, i.e. PF (prefetch), FE (fetch), DC (decode), EX (execute)and ED(execute dummy stage), as can be inferred by figure 5.1. The prefetchstage instanced here is not explicitly defined in the ARM architecture [16], buta prefetch operation is executed and affects all the operations which involve theprogram counter contents. The last stage (execute dummy stage) is necessary toexploit the polling operation mechanism, which allows a LISA operation to reactivateitself in the following machine cycle by stalling the subsequent pipeline stage. For

118


PF/FE

PF

FE/DC DC/EX EX/ED

EX EDFE DC

prefetch fetch decode execute execute dummy

Figure 5.1. The LISARM pipeline structure

this reason it represents a dummy stage and, by the fact it does not perform anyoperation, the processor generator does not implement it in HDL.Every couple of the defined pipeline stage is staggered by a register, which containsa number of signals used to transfer information from a stage to the subsequent.The pipeline registers are created automatically for all stages and are referred asPF/FE, FE/DC, DC/EX and EX/ED. All these registers are generated in the HDLfiles, but if their outputs are not used by other components, they’re not implementedin the synthesized hardware. The pipeline structure is reported in figure 5.1.The used registers are:

• instruction register and the program counter pc.

• reg1 i, reg2 i, reg3 i which store the operand registers indexes.

• regd i for the destination register indication.

• op2 for second operand immediate values storing.

• registered opd which points out if the second operand is registered or animmediate value.

• PSR update f, PSR select, PSR flag assign and PSR transfer f for opera-tions which involves one of the PSRs.

• bs op, registered bs amount, bs amount, bs special op f andbs amount32 f for the barrel shifter setup.

• alu op for the ALU setup.

• write result f to enable the writing of the ALU result to the destinationregister.

119


• mem access f, writeback f, pre npost indexed f and byte access f forthe memory access instructions.

• mul acc f for the multiply and accumulate instructions.

Most of them are used to transfer register indexes for the register file managementand setup values for the datapath (barrel shifter, 32x8 multiplier, ALU) from thedecoding to the execute stage. The instruction register have a meaning only for thefirst two stages, because the decoding stage splits the single instruction fields in thesignals cited above.

Many other variables, registers and signals are defined in the main.lisa fileand they are all considered as global resources, hence accessible by all the LISAoperations defined in the model. Among these resources all the wires defined forthe ALU, multiplier and barrel shifter control signals and data assignment, someregisters for the memory interface, the result and other auxiliary variables for thedatapath, the register list register for the block data transfer operations, thecycle c and the branching c counters can be underlined.

Main clock and reset input are generated automatically and also data and addressbuses; the memory interface signals, like MAS, nRW, LOCK, nMREQ and SEQoutputs and the ABORT input are also defined in the file resource section. Besidethese signals the FIQ and IRQ input for the interrupt management are definedand also the coprocessor handshaking signal CPI, CPA and CPB, although thecoprocessor instructions are not implemented in the model. An output signal wasadded to the model, the BS 2-bit line used to communicate the memory managementunit which is the byte selected for the transfer within a word; some details aboutthis signal are reported in section 5.8 and in the paragraph 6.2.

In order to maintain a certain hierarchy in the generation of the HDL files, allthe LISA operation used in the model are grouped using the language dedicated in-structions within the main.lisa file. Beside the fetch and prefetch units, whichcontain only the homonymous operation, the decoder unit groups all the decod-ing LISA operations. Four separate units are dedicated to ALU, barrel shifter,multiplier and condition checking operations; all the operations which contains theARM instruction behavior are grouped in branch exec, data processing exec,memory access exec, multiply exec and other ops exec. The HDL Generatorrespects the grouping scheme proposed and generates different files and units (com-ponents in VHDL or modules in Verilog) for the every LISA unit defined.

5.1.2 The main LISA operation

The main operation, described within the main.lisa file, is not assigned to a pipelinestage, by the fact it drives all the pipeline control signals and controls the single

120


stages behavior. To do so, the main operation is automatically executed at everyclock cycle, but before its first execution a processor initial status must be forced.

To perform the processor initialization, the reset LISA operation is provided;the operation is executed automatically every time the reset input is driven lowand before running every simulation within the Processor Debugger. It initializesto zero all the internal registers and flags, sets the processor state to ARM, theprocessor mode to“user”and assigns adequately the CPSR. The real ARM processorbehavior, after a reset request, forces the execution to resume in supervisor modebut, for debugging purposes, here the user mode is assigned. By the fact the programcounter is reset, it starts fetching the first instruction at the address 0x00: to avoidthe fetching of the exception vectors, which follow in the memory, an initial branchoperation has to be executed in every program executed.

When the main operation is executed, the condition global variable is evaluatedand one of the instructions described in section 5.1.3 is directly called to check ifthe condition expressed in the instruction is satisfied by the current PSR flags. Theresult of the evaluation is assigned to the condition valid global flag and used bythe other LISA operations when an instruction enters the execution stage, in orderto establish if the instruction itself must be executed or not.The main operation contains also two LISA specific statements which enable theconcurrent execution of all the operations activated in the pipeline stages and thenexecutes the pipeline shift, in order to transfer all the pipeline register values froma stage to the subsequent.The exception handling capabilities of the ARM processor are also implemented bythe main operation, by monitoring some specific input signals. If the F-bit and theI-bit of the CPSR are set (ref. section 3.3.5), nFIQ and nIRQ inputs are checked andan interrupt request on these lines causes the immediate assignment of the exceptionvector to the program counter and a jump to the handler, as defined in section 3.4.If a multicycle operation is being executed, anyway, the interrupt request can notbe satisfied immediately and the exception handling is postponed at the subsequentmachine cycle, when the input relative signals are checked another time. To estab-lish if an operation which require more than one machine cycle is being performed,the LISA predefined <pipeline stage>.stalled() method is used.When a memory access operation is performed, the ABORT input line can be drivenhigh by the memory management unit (MMU). This occurs when the memory sys-tem is unable to retrieve data or instructions because of access or addressing prob-lems (ref. sections 3.7 and 3.4). The main operation has to monitor the ABORTsignal cycle by cycle to handle the proper data abort or instruction abort exceptionwhen one of them arises. To do so, the exception vector value is assigned to theprogram counter, without checking if a multicyle operation is being executed or thepipeline is stalled. The abort exception handler, in fact, has to manage the situationby itself, without particular hardware mechanisms, as described in the section 3.4.

121


In the activation section the prefetch, fetch and decode operations are ac-tivated, but only if the pipeline is not stalled. The prefetch LISA operation isresponsible of the instruction fetching from the memory and so has to perform theprogram counter increment or, if a branch operation is being executed, it has toassign the branch destination address to the program counter. To do so the BPC

global variable is used, where the branch operations store the address calculated bythe ALU. Some flags and counter are used to recognize the subsequent branchingcycles and this signals are used to communicate informations between the prefetchoperation and the LISA operations involved in the branch execution. By the factthe byte addressing can not be realized in the LISA model, an Internal Fetch Ad-dress (IFA) is calculated and used within the specific LISA function which performsthe instruction fetching. Some other statements in the operation are put only fordebugging purposes and they are excluded from the HDL generation by using aC-precompiler pragma directive.

The prefetch operation drives directly two of the memory interface signals, thenMREQ and SEQ outputs, which communicate to the MMU what type of memorycycle is performed (ref. section3.7). In order to select the appropriate sub-word sizeddata, the BS signal is assigned cycle by cycle at the same value of the least significantbits of the program counter for an instruction fetch and of the mem data reg for adata fetch; the reasons of that choice are explained in section 6.2. To do so thebranching and the mem access f flags are inspected at every cycle and also someother pipeline registers. By the fact the prefetch operation is activated only if thepipeline is not stalled, these signals remain the same during these cycles, but thisbehavior is coherent with the ARM processor features.

The fetch operation does not execute any statement and simply activates thesubsequent stage operation, i.e. the decode operation.

By the fact that main and prefetch operations assign only the program counterand the instruction register values, all the other pipeline flags and registers are au-tomatically initialized to zero, so there’s no need to reset them manually within thevarious LISA operations. To enable a particular execution stage function, therefore,only the specific flag or signal has to be assigned. For debugging purposes, manypipeline flags are used in the model, in order to have a better control of the processorbehavior during simulations. This redundancy can also be reduced optimizing thenumber of the used signals, but this technique implies a sort of additional decodingphase to be performed in the execution stage, which effects can heavily affect thestage critical paths. The complexity can be reduced considering some parametersthat only the HDL synthesis phase should provide, so that these modeling aspectscan be analysed in depth.

122


5.1.3 The coding tree and the decoding mechanism

The decode operation executes the decoding of the instruction loaded into the in-struction register and, to do so, the entry point (coding root) for the decodingstep-by-step process is defined in its behavior section. Within the operation declaresection the various instruction groups are defined and the LISA decoding mechan-ism evaluates the instruction binary format and the operations coding sections toestablish which of them must be activated in the subsequent step. A sort of codingtree is formalized in figure 5.2 and the assembler and disassembler generation followsa similar approach to associate the syntax sections and the coding sections of LISAoperations defined in the model.The behavior of every group of instructions is described by two sets of LISA op-

CODING ROOTmultiply_grp

MULMLA

MULLMLAL

data_proc_grp

arith_logic_grp

mov_grpcmp_grp

MVNMOV

EOR BIC

ADDSUB

ADCSBC

ORRAND

CMPCMN

TSTTEQ

LDMSTM

STRHLDRH

LDRSHLDRSB

su_data_grp

block_data_grpstd_data_grp

STRLDR

mem_access_grp

PSR_access_grp

MRSMSR

BBL BX

branch_grp

other_grp

SWPSWI

Figure 5.2. LISARM coding tree diagram

erations: the decoding operations (<intruction> dc) and the execute stage mainoperations (<intruction> ex). The former set of operations inspects the instruc-tion coding and assigns a set of pipeline registers and flags, defined in order to storeall the configuration informations for the datapath and the other execution stageparts. A list of all the LISA operations used for the model description is reportedin appendix A. The execution stage operations, in the subsequent clock cycle, read

123


these informations and assign to the multiplier, barrel shifter and ALU related wiresthe values stored in the pipeline registers. Before the execution stage operations areperformed, anyway, the conditional execution modifiers are evaluated. This in-struction conditional execution is controlled by two set of LISA operations: the<cond> dc operations are activated by the instruction decoding operations and setthe condition pipeline register; the <cond> ex operations, belonging to the executepipeline stage, check if the condition is valid by verifying the CPSR flags. In affirma-tive case the condition valid global flag is set, so the execution of the instructiontake place. With cond is referred one of the condition mnemonics reported in table3.3, where the boolean expression used for the condition evaluation is reported. Theexecution stage operations are called directly by the main LISA operation wherethe pipeline stall events are monitored before the execution of the scheduled op-erations is performed. All the conditional LISA operations are described into theconditionfield.lisa file, which contains also an alias operation for the decodingof those assembly instructions which do not provide a conditional suffix.

5.2 The processor datapath

The ARM processor datapath is made up by a 32x8 multiplier, a 32-bit barrel shifterand the arithmetic logic unit (ALU) [18], interconnected with the register file andsome other registers as reported in figure 5.3. The behavior of these blocks is de-scribed in the following subsections and their port map reflects most of the originalcore characteristics.By the fact that their behavior is described by using the LISA language, it is im-portant to underline that units are activated in sequence, i.e. the barrel shifter isactivated by the multiplier and the shifter activates the ALU unit. To do so, theLISA activation mechanism is exploited and also the write result operation, de-scribed in section 5.3, belongs to these activation chain, by the fact it is activatedby the ALU unit operation. Every subsequent operation scheduling, also withinthe same pipeline stage, is performed only after the behavior section statements areexecuted, by setting a particular Processor Designer option.

5.2.1 The barrel shifter unit

The barrel shifter operations are described in a separate file (barrel shifter.lisa)and its modeling style allows to obtain an independent HDL unit, which receivean operand and some control signals as input and furnishes the expected second

124


Barrel Shifter

C_flagcarry_out

32 x 8

Multiplier

Register File

bs_carry_outC_flag

A_bus

B_bus

Figure 5.3. LISARM datapath scheme

operand for the ALU unit in output. The description is made by two main opera-tions: barrel shifter op dc and barrel shifter op ex, belonging the former tothe decoding stage, the latter to the execute stage. When the barrel shifter op dc

operation is executed, it simply activates the expected suboperation, among ASL,LSL, LSR, ASR and ROR. The meaning of these mnemonics was discussed in section3.5.4 and, apart from the assembly syntax purposes, the behavior section of theseLISA operations sets the pipelined control signals for the barrel shifter operation tobe performed in the following machine cycle. Two alias operations are used: by thefact that logical and arithmetic shift left are the same operations but have differentassembly mnemonics, ASL is aliased by LSL. The other alias operation is RRX,which coding uses the ROR#32 combination and sets up a dedicated control signal.

The barrel shifter functions, which must be implemented in hardware, are de-scribed in the barrel shifter op ex operation. Here the operation required is

125


selected by reading the control pipelined signals; the bs op 2-bit line is inspected,in order to establish what type of shift must be performed. The shift amount issupplied by the bs amount 5-bit wide pipeline register, allowing all values among0 and 31 to be expressed. bs special op and bs amount32 are two flags used toidentify if particular encoded shifter operations are expected. The former flag tellsthat a special operation, among those discussed at page 70, is necessary; the latterspecifies that a 32 bit shift is required. By using a combination of these two sig-nals the output operand and the bs carry out can be set as defined by the ARMdata sheet [16]. The figure 5.4 reports the complete barrel shifter port map, wherebs opd is the 32-bit input, alu opd2 is the output directly destined to the ALUsecond operand input and the bs carry out contains the last bit shifted out duringthe operation.The shift operation description (left and right, arithmetic or logical) uses C lan-

alu_opd2

bs_carry_out

2bs_op

5bs_amount

bs_special_op

bs_amount32Barrel Shifter

bs_opd

C_flag

Figure 5.4. LISARM barrel shifter

guage shift operators, and a particular trick is used to perform the logical shift rightoperation1. Instead of executing a shift operation by a specified amount, many single

1The C operator“>>”performs an arithmetic shift right and an operator for logical right shift

126


bit shifts are executed and, at every new step, the most significant bit is reset. Therotate right operation is described by using a switch statement which selects one ofthe thirty-two possible values for the rotate amount. In every branch case a for loopis used to perform a shift right operation by a single bit, in order to save the shiftedout bit and reinsert it at the most significant position. This description approachseems to be redundant, but is the unique possible, by the fact that LISATek HDLGenerator does not accept variable indexes in“for”loops. To extract and modifysingle bits within an operand, the input value is loaded into a CXBit 33-bit variableand some of the predefined LISA language methods are exploit. By using the LISAcasting functions the result is then converted in a suitable 32-bit data type for thebarrel shifter output and a single bit for the carry out. The activation section ceasesthe control to alu operation, for selected computation.

5.2.2 The arithmetic logic unit

The ALU functionalities are described using the same method seen in the previousparagraph (5.2.1), so that the HDL generator will produce a block which respectsthe scheme reported in figure (5.5).The port map is the same of the ARM architecture and all the input signals, except

alu_opd1

alu_opd2

result

alu_op4

C_flagcarry_out

Figure 5.5. LISARM arithmetic logic unit

the second operand alu opd2 coming from the barrel shifter, belong to the DC/EX

is not defined.

127


pipeline register output. The carry in signal is selected between the barrel shiftercarry out and the CPSR C-flag, depending on the operation to perform. All the ALUoperations are defined within the alu operations file and the fundamental LISAoperation is alu operation, which reads the selected operation on the 4-bit alu op

pipelined signal and selects the logical or arithmetic operation to call. There is a setof sixteen LISA suboperations which can be called and corresponds to the operationsprovided by ARM data processing instructions. These suboperations can be usedboth by data processing instructions and other ones, e.g. by the branch and memoryaccess instructions to perform address calculations. The LISA suboperations areidentified by alu <opcode>, where opcode is the mnemonic defined in table 3.4.The action performed by the single operation is also reported in the same table andthe result is assigned to a global variable to ensure it can be accessed by every otheroperation within the model. For basic ALU operation description C-like syntax isused, so it can be considered as a behavioral style modeling. In order to updatethe CPSR N-flag the bit 31 of the result is assigned to a single bit variable, thisinformation is also useful to establish if an overflow has occurred during a sum orsubtraction operation2.The CPSR flags update is executed only if the pipelined PSR update f flag is set,hence the alu operation activates some further operations with respect to the ALUcalculation performed. ADD/ADC, SUB/SBC, RSB/RSC operations are executedin different manner and the overflow condition checking is made by using differenttechniques, for this reason three LISA operations are provided: add PSR update,sub PSR update and rsb PSR update. A fourth operation, logop PSR update, isactivated directly by alu operation for logical instruction and by the three listedoperations for arithmetic instructions.Not all the ALU operations need the result be stored in a general purpose register,an address calculation, e.g., requires to be stored in the memory access register. Themain operation establish if the result must be written and where by checking thewrite result f and mem access f pipelined flags respectively, activating one of theoperations between write result and mem access. These operation are describedwithin the other operation section (5.3).

5.2.3 The 32x8 bit multiplier

The ARM processor architecture provides a 32x8 bit multiplier, which exploits the8-bit Booth’s algorithm to perform both a signed or unsigned multiplication. Theinternal ARM hardware uses an extension of the radix-4 Booth’s algorithm to im-plement an high speed multiplier [7], by using four carry save adder layers and some

2checking the overflow condition by a control on most significant bits carry is expensive in LISA,so the operands and result sign are taken in account.

128


other combinational logic to perform 2-bit shifts. This complex structure is quitedifficult to describe by using the LISA language mechanisms, so the description ofthis part of the model uses a behavioral style, leaving to the HDL synthesizer theburden of the structural implementation.The signed operands must be in 2’s complement format and by using this unit ispossible to obtain a result which respects both signed or unsigned convention. Toobtain the product of two 32-bit operands it is necessary to execute a multistep op-eration, which at every step shifts left the value furnished by the multiplier by eightbits or multiples and then sums the partial results. The multiplier op LISA op-eration describe the multiplier behavior by a C language statement which performsthe multiplication of the 32-bit extended 8-bit multiplier and the 32-bit multipli-cand without describing the operation in depth. The result is put on the outputwire mul result w and this signal is assigned to the barrel shifter input within themultiplying operations described in section 5.7.

mul_ctrl

mul_result

mul_sel

Multiplier

32 x 8

multiplicand_w

multiplier_w8

2

Figure 5.6. LISARM 32x8 multiplier

129


5.3 Other LISA operations

The model LISA operations use some common suboperations to convert numericvalues expressed in the assembly syntax and also to access directly the registercontent. These operations use some specific LISA language capabilities to return anopportunely converted numeric value to the calling LISA operation. Moreover, nooverhead is generated in the model, because the same logic is shared with all theunits that require the conversion. The reg index operation converts the unsignednumeric index used in the assembly syntax to express a source or target registerand returns to the calling operation a 4-bit unsigned integer. The return value isused every time the register file must be accessed in the same manner an arrayelement is accessed in C. On the other hand, when the content of a register mustbe read, another similar operation can be used. In such case the conversion is a bitdifferent and includes also the register identifier (its name) and not only its index.The instruction sets also a global flag named pc f, which is useful to signal thewrite result operation (described at the end of the paragraph) that the programcounter (R15) access has been requested.

A particular decoding operation is dedicated to the mechanism implemented bythe ARM processor for the generation of some immediate values. The ARM as-sembler, in fact, accepts 32-bit immediate values as operands only if they can beexpressed by using a 8-bit wide immediate value and 4-bit wide right rotation semi-amount. Starting from these two data, the barrel shifter recreate the original valueintroducing the immediate value into the barrel shifter and performing a right ro-tation by a number of bits equal to twice the amount value. It is obvious that onlysome particular integer values can be transformed in the so called immed8 r form,all powers of two, for example. The ARM assembler is able to check if an expressedimmediate value can be transformed as described. If the assembly instruction is anarithmetic, move or compare operation, an alternative form can be exploit: by thefact that everyone of these operations provide an opposite operation the 1’s comple-ment form of the operand is used and the operation mnemonic which compares inthe assembly code is changed, i.e. ADD becomes SUB, MOV becomes MVN, CMNsubstitutes CMN and so on. The LISATek toolsuite does not allow to develop theassembler in this manner but, to ensure the compatibility with ARM compilers andassemblers, some tools were added to the generated toolchain (ref. section 6.4). Inorder to allow the barrel shifter to extract the original value the immed8 r LISA op-eration performs the decoding of both the immediate 8-bit value and 4-bit amount,sets properly the op2 and bs amount pipeline registers and assigns the BS RORoperation to the bs op pipelined signal. The immediate values are recognized bycalling two dedicated operations, discussed in section (5.3). Because the operand isput into the pipeline register, also the registered opd flag must be set.

130


Other operations, referred as immediate value <n>, are used to convert un-signed immediate values expressed in the assembly syntax, which represent operands.Their behavior is quite similar to the reg index operation, but some of them findand remove the typical symbol“#”used in most assembly languages. These opera-tions converts unsigned integers values of different bit size: 4-bit rotate amount and8-bit immediate value for the immed8 r format constants, 5-bit amount for the barrelshifter operations, 12-bit immediate offset for the memory access instructions anda 16-bit particular integer value for block data transfer register list coding. By thefact the immediate offset addressing mode accepts positive or negative integer values,the sign expressed in the syntax is not treated directly by the immediate value 12

operation, which converts only the numeric value.The write result operation is used every time the result of an ALU operation

has to be stored in a register and is activated directly by the alu operation (ref.section 5.2.2). By checking the branching and the mem access f flags this oper-ation establish if the ALU result must be stored into the mem address reg for amemory access, in the BPC register for the branch performing or into the registerfile. In the last case, the destination register is selected by using the index storedinto the reg2 i pipeline register and, if the PC (R15) is selected, a flush operationis scheduled3. By the fact that some operations require content transfer betweenCPSR and a banked SPSR, the PSR flag assign pipelined flag is checked and sothe processor mode register is used to select the proper register bank. The samestatements are used to perform the transfers required by MSR and MRS instructions(section 5.6).

5.4 The branch instructions

The ARM instruction set provides two types of branch instruction, a standardbranch operation, with or without link (B or BL respectively), and a particularbranch and exchange operation (BX), which provide the capability to switch be-tween ARM and THUMB state. The BX instruction accepts the indication of theregister containing the branch destination address which must be an absolute value.On the other hand, the B and BL instructions use a branch offset and perform aPC relative jump to another address, requiring an ALU operation to determine thebranch destination.When the decoding operation enters the branch operation group (branch grp), ithas to inspect the coding of both the possible branch operations to select the rightsubsequent LISA operation to activate, i.e. B dc or BX dc. The BX dc operationresolves the source register index and stores it in the output pipeline register, in

3Two clock cycles for the pipeline refilling are required.

131


order to allow the execution stage to retrieve the branch destination address. TheBX ex operation, belonging to the EX pipeline stage, is activated, so that the otheractions for the branch performing are scheduled for the subsequent machine cycle.Entering the execution stage, the operation schedules the predefined pipeline stalloperation4, to avoid the execution of invalid and unnecessary instructions alreadyloaded in the pipeline. Furthermore it copies the branch destination address, readby the operand register, into the branch program counter (BPC), which is then usedby the prefetch operation. By using the predefined SetRegion5 method, the CPSRT flag is set to the operand least significant bit value (Rn[0]) and this can cause astate switching, from ARM to THUMB or viceversa.

The standard branch instruction decoding is performed by the B dc LISA oper-ation, which is far more complicated with respect to the BX instruction. Here thebranch destination address must be determined by using the PC content and theoffset can be expressed by using an immediate offset, an assembly label or anothersymbolic name. To do so, an ALU operation must be scheduled and the relatedpipelined signals need to be set. The decoding operation selects the right operationto activate, between imm branch offset and symb branch offset. The former op-eration reads and converts a 24 bit immediate offset, expressed in the assemblycode (section 3.5.3), shifts it left by two bits and sign extends6 the result to 32 bit.This value is then saved into the bs opd pipeline register, whose output is directlyconnected to the barrel shifter input port. The symb branch offset performs thesame operations seen above, but accepts also an assembly symbol, hence it has toconvert this information in a different manner. The LISA language provide somemechanisms to simplify these operations, which involve the same instruction mem-ory address and so the symbol table used both by the assembler and linker. TheB dc instruction, finally, inspects the link bit used in the coding to establish whetherthe relative next instruction address must be saved in the link register. The link bitvalue depends on the presence of the“L”label in the assembly syntax and it enablesfurther operations if the linkflag is set during the decoding step. The next stageinstruction is activated. The B ex operation, like the other operation, schedules thepipeline flush and determines the branch destination address by adding the offsetstored in the bs opd pipeline register to the program counter address, hence R15 isused as the first operand and assigned to the alu opd1 wire. The offset is assigned tothe bs opd w wire and the other barrel shifter signals are properly modified, in orderto avoid any shift operation7. The write result f flag is set and PSR update f is

4the flush method allows to clear single pipeline registers or whole stages by expressing theirnames in C++-like syntax.

5The SetRegion method is a predefined LISA function which is able to modify groups of bitwithin a CXBit variable or register.

6the sign extension is executed by masking the value with an ad hoc mask.7LSL#0 is performed, so that the barrel shifter output is identical to its input.

132


reset to avoid ALU flags modifications. The linkflag is checked to establish if thePC has to be saved into the link register (R14). The barrel shifter op operationis then activated and, if the branch is with link, also the BL ex poll operation isscheduled. The former implies the sequential activation of the alu operation andof the write result operation. This implies that the ALU result is saved into thebranch program counter, used by the prefetch operation to perform the jump. Thelink register value correction is performed by the BL ex poll operation in three con-secutive machine cycles, for this reason, after the first activation through the BL ex

operation, it needs to reactivate itself exploiting the polling mechanism. To do so,the EX/ED pipeline register is stalled during first and second operation executionand, to respect the BL instruction cycle times, the correction is executed in the lastcycle it is activated. The operation uses the branching counter variable to estab-lish which step of the branch instruction is running and in the third cycle sets theinput signals for barrel shifter and ALU units. Here the subtraction is selected bysetting alu op to the ALU SUB value and selecting R14 as the first operand; the con-stant value“4”is used as second operand and must not be modified passing throughthe shifter (LSL#0 performed). The destination register is set to R14, and also inthis case, CPSR flags update is not required. The barrel shifter op is activatedand so the alu operation and write result operations; they perform the ALUoperation and storing the result in the link register.Both <instruction> dc LISA operations set the branching global flag, so that theprefetch operation can execute the statements dedicated to the jump performingand some other instructions introduced for model debugging. The branching flag isreset after some machine cycles by an ad hoc operation, activated by the prefetch

operation.

5.5 Data processing instructions

The ARM instruction set provides sixteen data processing instructions, which canbe grouped in:

• Arithmetic operations: ADD, ADC, SUB, SBC, RSB, RSC.

• Logical operations: ORR, AND, EOR, BIC.

• Compare operations: CMP, CMN, TST, TEQ.

• Register move operations: MOV, MVN.

The first group provides the sum, subtraction and reverse subtraction operationsin two versions, with or without the input carry or borrow affecting the calculation

133


result. The compare operations are based on logical operations but the result affectsonly the CPSR ALU flags and is not written. The details on the actions performedby the single operations are discussed in the paragraph 3.5.4 and reported in table3.4. For assembly syntax and coding aspects, the logical operations are grouped withthe arithmetic instructions in the arith logic grp. When the decoding operationenters the data processing operations group (data proc grp), it inspects the codingof the provided operations and activates the proper subgroup operation, which canbe cmp grp, mov grp or arith logic grp. These groups have many differences inaccepted arguments number, type and assembly syntax, but expect the same codingfor the second operand, as discussed in section 3.5.4. The second operand decodingoperation can be immediate operand or shifted reg operand and it is activatedby everyone of the group decoding operations reported above. The discriminationbetween the required operation is done by inspecting the 4-bit opcode field and usingthis information the subgroup operations can activate the right LISA operation.

The cmp grp LISA operation sets the PSR update f flag in the output pipelineregister without checking any coding field, because for these instructions the CPSRupdate is implicit, the“S”suffix can be omitted in the assembly syntax. The behaviorsection stores the first operand index in the reg1 i pipeline register and then theactivation section selects the proper <opcode> dc operation, where the opcode in-struction can be one of the CMP, CMN, TST or TEQ mnemonics.

All the operations which names are in the <opcode> dc form, execute the sameoperation, in fact they set the alu op pipeline register to select the ALU unit oper-ation which will be performed when the instruction enters the execution stage. Thelast section of these operations activates the data proc setup operation, i.e. a setof statements which set up many of the signals for the execution step control, likeALU behavior and register file writeback operation.

The mov grp LISA operation sets the PSR update f flag only if the S-bit is setin the coding; this option is selected by expressing the“S”suffix in the assemblysyntax and causes the CPSR flags update at the end of the instruction execution.The destination register index is stored and then the activation section selects thesubsequent decoding operation between MOV dc and MVN dc. The former sets theALU so that the second operand is transferred to the destination register as itcomes form the barrel shifter output, the latter requires the bitwise complement ofthat output. The ALU setup is done by using the usual alu op pipelined signal.

Also arith logic grp LISA operation sets the PSR update f flag only if theS-bit is set in the coding, so that CPSR flags will be updated. First operand anddestination register indexes are stored in pipeline registers and then the activationsection selects the subsequent decoding <opcode> dc operation, where opcode is amnemonic among ADD, ADC, SUB, SBC, RSB and RSC. Their meaning are re-ported in table 3.4 and the ALU operation selection is done by using the alu op

pipelined signal.

134


The second operand decoding operation is activated by every data processing in-struction, in order to set the control signals for the barrel shifter unit, where theoperand should pass before entering the ALU. To do so the selected operation can beimmediate operand or shifted reg operand. The first LISA operation activatesdirectly the immed8 r operation without executing other statements, but its codingsection is necessary to add the I-bit value (fig.3.7), which discern an immediate off-set from a registered offset. On the other hand the immed8 r operation was alreadydescribed in section 5.3.

If the second operand is a registered value, the I-bit value in the coding is zeroand the shifted reg operand operation is activated. The operation stores the sec-ond operand register index in the reg2 i pipeline register and the registered opd

flag is now set. By inspecting the instruction coding, a further operation amongshifted reg, non shifted reg, RRX or reg amount shifted reg is activated, inorder to set the barrel shifter control signals into the pipeline registers.

The shifted reg LISA operation assigns the amount for the barrel shifter opera-tion to the dedicated pipeline register; to do so a 5-bit immediate conversion functionis used. All the other settings are executed by the barrel shifter op dc operation(section 5.2.1), which is then activated. If the registered operand value must notbe shifted in any manner, the operand has to pass through the barrel shifter reach-ing the ALU second operand input without alterations. To do so the barrel shiftercontrol signal must be set to perform a LSL#0 operation. The non shifted reg op-eration is provided to assign the BS LSL value to bs op and to reset the bs amount

pipeline register and the other special barrel shifter operation flags.The RRX LISA operation is activated if the homonymous barrel shifter operation

is required and it acts as the non shifted reg operation discussed above, with theonly difference that it assigns the BS ROR value to bs op pipelined signal. Thisis because the RRX operation exploits the coding of the ROR#0 instruction, asdescribed at page 70.

Since the shift amount can be stored in a register, the reg amount shifted reg

operation sets the registered bs amount flag and stores the source register indexin the bs amount reg pipeline register. This situation implies a particular execu-tion unit behavior, because the datapath must be used to read the source regis-ter in a first cycle and to perform the barrel shifter operation in the subsequent.To do so the data proc setup is activated and so it is scheduled for the subse-quent clock cycle (belonging to the EX stage). The register file access is doneby the shift amount reg access LISA operation under the control of the sched-uled operation; its statements executes a set of assignments to bs special op f,bs amount32 f pipelined flags and to the bs amount register setting up the barrelshifter unit for the next cycle operation. The values assigned depend on the shiftamount register content and respect the ARM behavior described in the data sheet[16] and at page 70.

135


As discussed above, when a data processing instruction must be executed, thedata proc setup operation is activated in the execution pipeline stage, so that itis executed when the instruction enters the stage. Since the instruction executionmay take more than one clock cycle, this operation exploits LISA polling capabilitiesand due to the registered bs amount value it can stall the EX/ED stage to allowits reactivation in the subsequent machine cycle. After the condition valid flagchecking, in fact, the operation evaluates if the barrel shifter operation amount mustbe read from a register and if required the whole pipeline is stalled for one clock cycle.The polling activation variable is reset to avoid a new activation in the subsequentclock cycle and, as discussed above, the shift amount reg access is called. Atthe reactivation (second clock cycle) the setup operation activates the barrel shifterunit and assigns the registered control signals and operands to the internal wiresconnected to the ALU, so that the data processing instruction can be performed.If the instruction belongs to the compare group the write result f flag is reset,otherwise it is set to ensure the writeback operation is performed (section5.2.2). Thebarrel shifter ex operation is finally activated, so that the datapath componentscan operate.

When the processor executes a data processing instruction and the destinationregister is the program counter (R15), two further clock cycles are expected for theexecution. This behavior is due to the fact that a pipeline flush is necessary and thepipeline refilling needs two machine cycles before the new addressed instruction canbe executed. To do so, the LISA operation schedules a flush operation on PF/FEand FE/DC pipeline registers for the subsequent clock cycle.

5.6 PSR transfer instructions

The PSR transfer instructions are used to access directly the processor status re-gisters CPSR and SPSR, modify their values or save them to other general purposeregisters. The MRS instruction copies the CPSR or the SPSR <mode> contentinto the destination register expressed in the assembly syntax. On th other handthe MSR instruction, moves a register content into the CPSR or to the bankedSPSR, respecting the processor operating mode. The latter instruction allows alsothe modification of the CPSR or SPSR ALU flags (ref. section 3.3.5), by using aregistered quantity or an immediate value which can be expressed in the immed8 rformat. All the PSR tranfer instructions execute in a single machine cycle, henceno polling operations are used for the behavior modeling.

The decoding operation starts with the PSR access grp LISA operation, whichactivates directly another operation among MRS dc, MSR dc, MSR flg dc. The MRS dc

operation converts the Rd destination register index and the PSR source register

136


by checking the P-bit in the coding. The PSR selection allows different assemblyformats, hence many suboperations are used to define the various syntax sections,also if the resulting coding is the same. The PSR transfer involves the datapath forthe register file access and so the operation sets all the necessary signals to avoid datamodifications; the write result f flag is also set. The PSR select pipelined flagis used to signal the execution unit to transfer the current status register (CPSR)or the saved status register (SPSR) and the MRS ex operation is then activated.Entering the execution stage, the MRS ex operation takes the control, checks thecondition satisfaction and evaluate the PSR select pipeline flag to establish whichPSR must be assigned to the barrel shifter input bs opd. If the SPSR <mode> hasto be transferred, the processor mode is evaluated to perform the right register bankaccess. The datapath operations are enabled by activating the barrel shifter op

and consequently the write result operation is activated at the end of the ALUoperation.

If the opposite operation must be performed, the MSR dc is activated for thedecoding step; the operation converts the Rm source register index and selects thePSR destination register by checking the P-bit. For the PSR selection, the previousconsiderations are valid, the datapath and write result f flag setup is also thesame. The PSR transfer f pipelined flag is used to signal the write result oper-ation that a PSR is selected for the transfer operation and the PSR select flag isused to signal the execution unit to transfer the current status register (CPSR) orthe saved status register (SPSR). The MSR ex operation is activated before exiting.Entering the execution stage, the MSR ex operation takes the control and evaluatesthe PSR select pipeline flag to establish which PSR must be assigned to the bar-rel shifter input. If the SPSR <mode> has to be transferred, the processor mode

is evaluated to perform the register bank selection. The datapath operations areenabled by activating the barrel shifter op and consequently the write result

operation is activated at the end of the ALU operation.To modify only the ALU flags within the current or saved PSR, the MSR flg dc

operation is activated. The operation works as MSR dc for the recognition of the se-lected PSR but activates an operation between MRS reg op or immediate operand todecode the source operand. The former operation converts the register index of theRm source operand and assigns it to the reg2 i pipeline register; the registered opd

flag is also set for the datapath configuration. The immediate operand operationactivates directly the immed8 r operation, without executing other statements (itscoding section is necessary to add a bit which allows to discern an immediate offsetfrom a registered offset). The immed8 r operation is described in section 5.3. Adedicated pipelined flag (PSR flag assign) is set to transmit to the write result

operation that only the PSR most significant bits are affected by the data transfer,so a masking operation must be executed.

When the MSR instruction enters the execution pipeline stage, the MSR flg ex

137


operation, activated by the MSR flg dc operation, takes the control and sets up thedatapath with respect to the relative pipelined signals; the usual barrel shifter op

operation is activated before exiting. The ALU result is used by the write result

operation, which masks opportunely and executes a bitwise“or”operation with theunaffected bit values read from the selected PSR; the obtained 32-bit value is thenwritten back to the same PSR.

5.7 Multiplication instructions

The ARM processor provide two multiplication instructions for 32-bit operandswhich can be both signed and unsigned. The signed operands must be expressed in2’s complement notation and one of the instructions (MUL) performs the multipli-cation returning a 32-bit result only, so the sign of the operands does not matterand can be omitted in the instruction syntax. The MULL instruction, otherwise,return a 64-bit result by storing it into two registers (an high and a low register)expressed in the instruction syntax. To perform the operation the ARM processoruses the 8-bit Booth’s algorithm and stores the partial sums along a variable numberof cycles. In this model, as discussed in section 5.2.3, the 32x8 multiplier block is notdescribed in depth but only described in behavioral style. The instructions, due tothe multiplying method, can save some machine cycles; to do so the most significantbits are inspected in 8-bit groups and if they’re all ones or all zeros some multi-plication steps are not performed (refer section 3.5.6). The discussed instructionsprovide also the accumulate option, so that a previous registered result can be addedto the multiplication result, saving an ADD instruction. All the instructions acceptfour register in their coding, Rs and Rm are always the multiplier and multiplicandrespectively, Rd represents the destination register in the standard instruction ver-sion and the high register in the long multiply instruction, Rn is the accumulator inMUL/MLA and the low register in the long version. The long multiply instructionuses the destination register also as accumulator and a previous value must be storedhere. The S-bit is inspected to establish if the CPSR flags must be updated andthis information is stored into the PSR update f pipelined flag. The multiply grp

decoding instruction inspects the condition field and converts the involved registersindexes, assigning the pipeline signals for the execution stage. As discussed above,the register usage changes between the multiply instruction and their long versionsbut, in the coding scheme, their positions are the same. By the fact a fourth registeris used, its index is assigned to the reg3 i pipeline register. The other coding bitsare evaluated to select the subsequent LISA operation to be activated.

If the MUL or the MLA form of the instruction has to be executed the MUL dc

is selected. The operation must give Rd = Rm ∗ Rs if the A-bit is zero and

138


Rd = Rm ∗Rs + Rn if it is set; in order to signal this condition to the EX stagethe mul acc f pipeline flag is set. The multiply and accumulate form of the in-struction gives Rd = Rm ∗Rs + Rn, which can save an explicit ADD instructionin some circumstances. The datapath signals are set up to perform the destinationregister initialization during the first execution cycle, setting all its bits to zero ifthe accumulate flag is reset or assigning the Rn register value otherwise. To do sothe ALU MOV operation is assigned to the ALU alu op input. The BS LSL valueis assigned to the bs op pipeline register and remains the same for all the steps;the shift amount bs amount register, otherwise, starts from zero in the second stepand is incremented by eight unit at every subsequent reactivation of the operation8.The write result f pipelined flag is also set, in order to save the partial multi-plication result in the destination register. The MUL ex operation is then activatedand so it takes the control when the instruction enters the execution stage. Theoperation exploits the polling LISA mechanism to ensure the reactivation duringsubsequent cycles, until all the multiplying steps are not performed. The first ex-ecution cycle initializes only the destination register and schedules a pipeline stallfor the execution cycle. At the first reactivation, the least significant bits of themultiplier are selected and assigned to the mul opd8 8-bit input of the 32x8 multi-plier. The multiplicand value is assigned to the mul opd32 and does not change anymore. The 32x8 multiplier result is assigned to the bs opd at every cycle, shifted bythe needed number of bits and sent to the ALU, which performs the sum betweenthe destination register. To activate the multiplier the multiplier op describedin the section 5.2.3, is activated. The barrel shifter op operation is activatedby the same operation. For the reasons explained above, at every cycle the mostsignificant bits of the multiplier are inspected to establish if the subsequent stephas to be performed or not. At the first reactivation of the operation the bits [31:8]are masked and evaluated, if they are all zeros or ones no cycle stall operation isscheduled, otherwise it will. To establish which step is in execution the cycle c

counter is used and incremented at every cycle; this information is also used toperform the left shift of the partial 32x8 multiplication result, which must be addedto the partial result stored in the destination register at every cycle. The variousdatapath configuration signals are simply transferred from the pipeline to the unitsand the barrel shifter op operation is activated. At the second reactivation themultiplier bits to be inspected are [31:16] and at the third [31:24]. If someone ofthese groups are all ones or zeros the last cycle is executed immediately, otherwisethe 32x8 multiplication operation is activated and executed, by selecting the rightgroup of bit to send to the 8-bit multiplier input.

The MULL and MLAL instructions behavior is modeled by the MULL dc andMULL ex operations, respectively in the decoding and execution phase. The first

8the multiplication mechanism is described in the section 5.2.3.

139


operation, assigns the same pipeline signals of the MUL dc operation and, in orderto establish if the sign of the operands must be considered or not, the signed f isset or reset with respect to the U-bit defined in the coding. The MULL ex operationis then activated and so it takes the control when the instruction enters the execu-tion stage. The operation must give {RdHi,RdLo} = Rm ∗ Rs if the A-bit is zeroand {RdHi,RdLo} = Rm ∗ Rs + {RdHi,RdLo} if it is set; to signal this conditionto the EX stage the mul acc f pipeline flag is set. The datapath signals are setup to perform the two destination registers initialization during the first executioncycle, setting all their bits to zero if the accumulate flag is reset. To do so theALU MOV operation is assigned to the ALU alu op input. The BS LSL value isassigned to the bs op pipeline register and remains the same for all the steps; theshift amount bs amount register, otherwise, starts from zero in the third step andis incremented by eight unit at every subsequent reactivation of the operation. Allthe other performed operations are similar to those seen for the ordinary operations,but the result is stored on two registers with respect to the previous version. To doso, some local variables and some masks are used within the LISA operations, inorder to ensure the correct multiplication algorithm implementation.All the multiplying instructions, if the PSR update f pipeline flag is set, has toupdate some of the CPSR flags with respect to the result obtained. Only the N(Negative) and Z (Zero) flags are correctly set with respect to the result (N is madeequal to bit 31 or 63 of the result, and Z is set if and only if the result is zero). TheC (Carry) and the V (oVerflow) flags are unaffected, respecting the ARM processorbehavior which assigns them meaningless values.

5.8 Single data transfer instructions

The processor furnishes two fundamental instructions for the memory access andthe transfer involves a general purpose register (program counter included) and amemory location. A“load”and a“store”instructions are provided and both wordand byte sized data transfer can be performed. A subtle variation of these firstoperations allow the sign extension of byte and half-word sized data, in order to givesigned and unsigned sub-word sized data types support. Many addressing modesare provided, also in PC-relative version; the ordinary operations accept also shiftedregister offset, while the signed/unsigned versions of the instructions does not allowto express a barrel shifter modification of the registered offset. Immediate offsetare also accepted, the difference between ordinary and signed/unsigned instructionsis in the bit width, 12-bit for the former, only 8-bit for the latter. The completeset of signals for the memory management unit are defined in the model, in orderto respect the memory model boundaries. Big endian memory format management

140


is not yet implemented, for reasons connected to the development environment.The description of the LISA operations which implement the signed and unsigneddata types is done at the end of the paragraph, by recalling what discussed for theordinary ones.

The main decoding operation is mem access grp, which activates directly oneof three possible suboperations; std data grp includes LDR and STR intructionssu data grp includes the signed and unsigned data versions and block data grp

provides the support for the stacking multiple data transfer operations described inparagraph 5.9.

The std data grp operation sets all the pipeline signals for the address calcula-tion, which will be performed by the execution stage in the subsequent clock cycle.The operations can access the memory in user mode or in privileged mode, so theprivileged mode access f pipeline flag must be set or reset to allow the relativeactions performing by the memory management unit. The source or destination re-gister is selected by storing its index in the regd i pipeline register and the transferdata size is selected by the byte access f pipeline flag. The operation activates theLDR ex or the STR ex operation, belonging to the execute pipeline stage, by inspect-ing the load/store bit in the coding. By using LISA coding inspection mechanism,one of the address decoding operations is activated, in fact the ARM processor ac-cepts three addressing modes: PC-relative, pre-indexed, post-indexed and also withno base displacement specification.

If the offset is not specified, the zero offset LISA operation is activated, so thatonly the base register index is stored in the reg2 i pipeline register and to signalthat to the execution unit, the registered opd pipeline flag is set. By default thisaddressing mode is considered a pre-indexed access, but the writeback operationis not scheduled, so pre npost indexed f and writeback f pipeline flags are setand reset respectively. The actions performed during the first clock cycle aim totransfer the base register content into the mem address reg, which is connectedto the address bus during data load or store operations. To inhibit any barrelshifter operation the BS LSL value is assigned to the bs op pipelined signal andthe shift amount register is set to zero. Also the ALU must be crossed withoutmodifications in the base register value, so the alu op pipelined signal is set toALU MOV.

If the assembly instruction does not contain a register indication, the addressingmode is PC-relative and the decoding operation program relative converts theimmediate 12-bit wide offset by calling the immediate value 12 operation. Theconverted value is then assigned to the op2 pipeline register. With respect to thezero offset operation the pipeline signals setup differs only in the ALU operationselected, because the fetch address contained in R15 must be used to calculate thememory access address. By inspecting the up/down bit value an operation betweenALU ADD or ALU SUB is selected, respecting the immediate offset sign expressed

141


in the instruction syntax. Using the program counter as the base register, thewriteback operation is not permitted in this addressing mode, so the relative flagand the pre npost indexed f signal are assigned in the same mode.

Some more details need to be discussed for the other LISA operations, whichdecode the pre-indexed and post-indexed memory access modes. The pre indexed

operation assigns the base register index to the reg1 i pipeline register and, byinspecting the W-bit, sets the writeback f pipeline flag if the writeback operationis requested by the assembly syntax. The operation activates the right operation,between immediate offset and shifted reg offset in order to resolve the flexibleoffset encoding [19].

The immediate offset operation executes the same operations view for thePC-relative mode and so converts the 12-bit immediate offset and stores his valuein a pipeline register. Analyzing the sign of that value, the up/down bit is set inthe instruction coding and this information affects the ALU operation selected byusing the dedicated pipeline signal: it will be ALU ADD for a positive offset andALU SUB otherwise.Quite different the behavior of the shifted reg offset operation that is activatedwhen a shift register operation is needed to perform the offset calculation. Hereall the barrel shifter operations for the second operand modification, discussed inthe section (5.5), are allowed, except registered shift amount transformations. Theup/down bit in the coding depends on the assembly syntax and selects the aluoperation to be performed in the execution step. The operation stores the Rn indexin the op2 pipeline register and a bit of the coding is set to select this operationinstead of the immediate offset one. By the reg shift label, a barrel operationbetween shifted reg and RRX is selected and activated. How they act for the barrelshifter signal setup is discussed in the data processing section 5.5.

The post indexed operation accepts the flexible offset syntax and coding, as thepre-indexed dedicated operation, and also the base register indication seen above,but it does not accept the writeback request because the base register is implic-itly updated before the address is transferred to the memory management unit.Some differences must be underlined in the assembly syntax, because of the squareparenthesis position and use, as reported in the section 3.5.8. For these reasons thepipelined signal setup for the execution units does not differ from the pre indexed

LISA operation.When the memory access operation enters the execution stage, the scheduled

operation takes the control. If the instruction register contains a store instructionthe STR ex is activated and, at the first execution cycle, it reschedules itself by us-ing the polling LISA mechanism. Before accessing the memory, in fact, the memoryaddress for the access has to be calculated, so the barrel shifter ex is activatedexiting from the operation.The memory access is performed in the subsequent machine cycle and the cycle c

142


counter9 is used to discriminate between the cycles. The pipeline is completelystalled to allow the performing of the described execution steps. The operationnamed barrel shifter ex activates the alu operation and this one activates thewrite result operation, which controls the mem access cycle f flag to establish ifthe ALU result must be stored to the mem address reg instead of affecting the regis-ter file. At the reactivation, the operation executes the setup of the memory interfacesignals assigning BS, nRW and MAS, taking in account the state of the byte access f

pipeline flag. If the writeback f pipeline flag is set, the writeback op operationis activated, so that the mem address reg values is written back to the Rn register,by setting the datapath in the usual manner and using the reg1 i pipeline registervalue for register file indexing. nMREQ and SEQ are defined step by step by theprefetch operation, because they play their role at every clock cycle and at everymemory access, i.e. an instruction fetch or a data transfer operation. The addressbus (A) and the data bus (D or DOUT) values assignments are performed by aspecific LISA operation, that has also the effect of read the source register value, byusing the regd i pipeline register content as register file index.

The data load instruction LDR behavior is modeled by the LDR ex operation,which performs a set of operation similar to the store STR ex operation but takesone more machine cycle to complete. In the first cycle the address is calculated inthe same way seen above, during the second cycle the setup of the memory interfacesignals is executed and in the third cycle the data bus is sampled to store the valuesupplied by the memory management unit. To discern what step of the multicycleoperation is in execution the cycle c counter is used and in the first two steps a stalloperation is scheduled, in order to allow the LDR ex operation reactivation. At thefirst reactivation (second execution cycle), the operation activates the writeback op

operation if needed and sets up the memory interface signals assigning BS, nRW andMAS for the access. The address bus (A) value is assigned by another specific LISAfunction, designed expressly for memory read operations. In the third step the databus (D or DIN) is accessed, and the sampled value is saved to the mem data reg

global register. The same value is written to the Rd register by setting the datapathsignal opportunely, so that no data modifications are performed. If R15 is selected,two further clock cycles are expected for the execution. This behavior is due to thefact that a pipeline flush is necessary and the pipeline refilling needs two machinecycles before the new addressed instruction can be executed. To ensure this behaviorthe write result LISA operation schedules a flush operation on PF/FE and FE/DCpipeline registers for the subsequent clock cycle.

If a load or store instruction involves a halfword sized data or when a loadinstruction requires the sign extension of the transferred data, the su data grp

9cycle c is a 2-bit counter used by all the memory access LISA operations that require morethan one clock cycle to complete.

143


operation is directly activated by the mem access grp one. Here some differentsuffix are used to express the data size and the signed version of the instruction andtwo bits of the coding scheme report these informations, S-bit and H-bit. A newpipelined signal is used to perform the setup of the execution stage, the signed f,which is assigned if the S-bit is active. The coding of these four operations has fewdifferences with respect to the standard STR/LDR operations and the addressingmodes are substantially the same described above. The immediate offset acceptedis only 8-bit wide and no shift operation is allowed for registered offset values. Thecoding of the unsigned immediate value is divided in two nibbles (4-bit groups), butthe bit splitting capabilities of the LISA language permit the correct conversion ofthe information and its storing into a pipeline register reserved to ALU operands.All the other bits like the up/down (U-bit), writeback flag (W-bit), load/store bit(L-bit) and pre/post indexing bit (P-bit) maintain the same position. The decodingoperation behavior is similar to the analogous section in the std data grp andactivates one operation among zero offset, program relative, pre indexed andpost indexed in order to recognize the addressing mode and setup the ALU unitpipelined signals. These operations are the same discussed above but, the last twoof them, checking the signed f, set the barrel shifter to perform the BS LSL withzero amount to avoid registered offset modifications.

When an instruction among LDRH, STRH, LDRSB and LDRSH enters theexecution stage, the operation su memory access, activated by the su data grp inthe decoding stage, takes the control and performs the same operations describedby LDR ex and STR ex, with some differences due to the subword sized data maskingand addressing prescriptions. Here the STRH store instruction has to drive only thesixteen data bus lines interested by the source value, so the other bits are reset bya masking operation applied directly to the mem data reg register. Same thing forthe LDRH load operation, where the proper mask is applied to the mem data reg

register to zero all the bits which are not significant, before the register file update.The signed data load operation, otherwise, has to check the most significant bit ofthe value loaded into mem data reg register to choose the right mask to apply, inorder to perform the sign extension expected by the instruction, the involved datasize.

Using the memory interface signals there’s no need to reset their values, becausethe nMREQ signal is active only when a memory access is needed and is directlymanaged by the main operation, which sets his value in the right manner cycleby cycle. The MAS signal is always assigned by using values defined in the modelheader file, i.e. wordsize, halfwordsize and bytesize. When a subword sizeddata is transferred, to avoid data overwriting, a masking operation is performed.This is necessary for data sampled from the data bus and stored to the register fileand is useful when the data bus is driven by the processor. For some aspects thismethod reduce the system power consumption but, due to the ARM7 specification,

144


the replication of subword sized data must be performed. This problem is discussedin paragraph 6.2, where the memory wrapping is discussed. All the masks used forsingle bytes and half-words are defined in the header file and to provide the memoryaddress expected by the processor specification some tricks are used. Also theseaspects are discussed in a dedicated paragraph in chapter 6 (section 6.2).

5.9 Block data transfer instructions

The block data transfer operations allow to load (LDM) and store (STM) a set ofthe general purpose registers from or to memory; they are designed to create andmanage memory stacks in a very flexible manner. The instructions support all thepossible stacking modes: starting from the base register value, the memory addresscan be pre or post incremented or decremented, so that the stack can grow up ordown in the memory space. The decoding operation (block data grp), activatedby the mem access grp operation, inspects some coding bits to understand whattype of addressing mode is required and assigns the pipeline flags for the execu-tion stage operations. Here there’s only a register index which can be expressed,to define the base register for the memory addressing. The base register contentmust be transferred to the mem address reg during the first execution cycle, so thedatapath setup is performed by assigning the dedicated pipelined signals, particu-larly write result f and mem access cycle f flags. Also the writeback operationis allowed and another information about the banked registers access for the transferoperation is defined within the coding, by the S-bit. If the S-bit is set, in fact, boththe load and store instructions require the transfer of the user bank register also ifthe operating mode is different by the user mode. This behavior is useful in pro-cess switching mechanism and has the particularity of allowing the processor modechange by transferring also the SPSR <mode> to CPSR, if the program counter isin the register list. The decoding of the register list is performed by calling a 16-bitimmediate value conversion function and the returned value is stored into an ad hocglobal 16-bit register. Some details about the register list conversion are furnishedin the section 6.4, where some LISA language and LISATek toolsuite limits are dis-cussed and an external solution is proposed. This register contains a flag for everypossible register file index and, by setting or resetting a single bit, the transfer ofthe pointed register can be transferred or not.During the execution phase the block data transfer polling LISA operation takesthe control and, setting up the datapath with respect to the pipelined relative flags,transfers the base register value in the mem address reg. If the writeback operationis requested, the writeback op is activated during the first execution cycle, so that

145


the mem address reg value is written back to the base register in the following ma-chine cycle. At every execution of the block data transfer operation a pipelinestall for the subsequent cycle is scheduled, in order to allow the operation to bereactivated automatically in the following machine cycle. No transfer operation be-tween the memory and the processor is performed until the second cycle executionand subsequent, when the same operation is identically repeated for every registerin list.In order to perform register transfer in numerical growing order (by index) thereg list global register is inspected bit per bit, starting from the least significantposition (which contains the R0 relative flag) and growing until a non-zero bit isfound. The pointed position represent the first register affected by the transfer op-eration and its index is used as the destination register for a LDM instruction or asthe source register for a STM instruction. Here a check on the presence or absenceof the PC in the register list is performed, by inspecting the bit 15 of the reg list

register. By the fact the byte access f pipeline flag is used to store the S-bit value,it is accessed to establish, in combination with the previous information, what re-gister bank is involved in the transfer and if the CPSR must be subscribed withthe banked SPSR. Now the memory interface signal setup is performed, in the samemanner discussed in the previous paragraph (5.8); the address used for the memoryaccess is incremented or decremented before of after the access itself, with respect tothe relative pipeline flag. The store operation executes exactly n cycles addressingand tranferring the n registers that have the flag set high in reg list. The loadoperation, otherwise, starts the data bus sampling in the third cycle execution andso the datapath for the mem data reg content storing must be configured duringthe second step. The load operation takes one more cycle with respect to the storeoperation, because of the required sampling of the values and its writeback into theregister file. Both instructions do not schedule other stall cycles when no further setbits are found in the reg list register; if the last register to be transferred is R15,the SPSR <mode> content is also copied into the CPSR. In this case a pipelineflush is scheduled and two further clock cycles are required for the pipeline refilling.

5.10 The data swap instruction

The data swap instruction is used to swap a byte or word quantity between a regis-ter and the memory. This instruction is implemented as a memory read followed bya memory write operation“locked”together. The processor cannot be interrupteduntil both operations have completed and, in order to avoid memory content mo-dification, the memory management unit is warned to treat them as inseparable,refusing memory access to other peripherals. The execution of a swap operation is

146


signalled to the memory management unit using the LOCK processor output, whichremains high during the operation execution.

The instruction decoding is performed by the SWP dc LISA instruction, which isactivated by the other grp instruction. By inspecting the instruction coding theindexes of the source, destination and base register are stored to dedicated pipelineregisters and the barrel shifter and ALU setup signals are assigned to allow the baseregister content to be saved into the mem address reg. The B-bit in the coding isinspected to establish the transfer data size, which can be a word or a single byte;this information is stored into the byte access f pipelined flag. The decoding op-eration activates the SWP ex operation and here the usual LISA polling capabilitiesand the cycle c counter are used to reschedule four subsequent machine cycles.During the first execution cycle the pipelined signals for the execution units (ALUand shifter) are transferred to their wires and the stall operation is scheduled forthe subsequent machine cycle. At the first reactivation (second execution cycle),the operation sets up the memory interface signals assigning BS, nRW and MAS forthe memory read operation and the relative data size. The address bus (A) valueis assigned by the specific memory access LISA operation, and the LOCK signal istied high. In the third step the data bus (D or DIN) is accessed, and the sampledvalue is loaded in the mem data reg global register. The same value is written tothe Rd register in the subsequent cycle, by setting the datapath signal so that nofurther data modifications are performed. In the same cycle the store operation isprepared, inverting the value of the nRW signal and maintaining high the LOCK sig-nal. Here the Rm register value is assigned to the data bus and the last stall operationis scheduled. During the fourth cycle the Rd register is written back and the LOCKcontrol signal is tied low. When a byte wide data is transferred, the masking opera-tions already discussed for the ordinary load and store operations are also performed.

5.11 Software interrupt and undefined

instructions

The software interrupt instruction (SWI) and the undefined instruction (UND) havethe common behavior of changing the program counter content to execute a jump tothe exception vector (section 3.4), allowing exception handling. Both the instructionhave to evaluate the condition field and the CPSR flags in the usual manner and ifthe condition is true the expected operations are performed. The difference betweenthe two instructions is that the software interrupt is executed without checkingany other signal, whereas the undefined instruction is not. This is due to the factthat, when the UND instruction is executed, the same instruction is passed to the

147


coprocessors connected on the data bus and only if the dedicated handshaking linesare tied low, the instruction is really considered as undefined and the undefinedinstruction trap is taken.

Entering the decoding stage, the other grp operation is activated and, by in-specting the instruction coding, one operation among SWI dc, UND dc and SWP dc

(see section 5.10) is selected and activated. The SWI dc operation ignores the 24-bitcomment field, so it does not pass anything to the entered supervisor mode. In orderto change the program counter value, assigning the data handler address (0x08), thedatapath must be properly set to execute a MOV operation and the barrel shifterhas to execute a LSL#0 operation, so that any data modifications are avoided. Thedestination register index (15) is stored into the relative pipeline register and thewrite result flag is set. Then the operation activates the SWI ex operation, which isexecuted in the subsequent clock cycle. When the SWI instruction enters the executepipeline stage the PC saving is executed, by assigning its value to the supervisormode link register (R14 svc), and the CPSR is also saved into the SPSR svc regis-ter. The processor mode variable is set to the supervisor value (defined in theheader file) and this information is transferred to the mode bits by the main LISAoperation. Because of the jump operation, a pipeline flush operation is scheduled,so the refilling takes two further clock cycles to complete the instruction execution.

The undefined instruction is decoded by the UND dc LISA operation, which activ-ates the EX stage UND ex operation doing nothing more. When the execution stageinstruction enters the pipeline EX stage, the polling LISA mechanism is exploit tore-execute the operation for two consecutive cycles and the cycle c pipelined signalis used for step counting; the whole pipeline is obviously stalled. In the first execu-tion cycle the coprocessor interface signal nCPI is tied low to start the handshaking(section 3.8) and in the subsequent clock cycle the CPA and CPB signals are evaluatedto establish if the instruction is accepted by a coprocessor or not. If both the signalare tied low the instruction is really undefined and so the exception vector, for theundefined instruction trap (0x08), must be stored into the program counter. Todo so, ALU and barrel shifter pipelined signals are assigned in the same way seenfor the previous instruction and a pipeline flush is also scheduled due to the jumpoperation. To perform the operating mode change, the program counter value isassigned to the undefined mode link register (R14 und) and the CPSR is saved intothe SPSR und register. The processor mode content is set to undefined, so thatthe main operation can properly assign the mode bits.

148

Chapter 6

LISARM support tools

This chapter describes the most important aspects of the tools used, generatedusing the LISATek development environment and then adapted to other tools yetavailable for ARM. By the fact the LISARM model and the generated toolchainare not completely compatible with ARM7TDMI specifications, some solutions areproposed here, in order to allow the memory interfacing and to make exploitablecommercial ARM family tools with the obtained model. Some further informationsabout the generated VHDL and related tests and simulations are also reported inthe last paragraph.

6.1 The ARM LISA simulator

In order to check the modelled processor behavior, the LISATek generated C++ sim-ulator has been used intensively. The Processor Debugger GUI accepts an objectfile as input and allows to monitor various model resources like registers, memories,internal signals, pipeline behavior and events which occur during the program exe-cution. By inspecting the windows reporting the assembly, the disassembly and theLISA microcode, the LISA operations behavior can be controlled in depth, step bystep, checking its effects on the processor state.

The model development has undergone two main phases:

• The instruction accurate model description.

• The cycle accurate model description.

In the first phase the pipeline behavior has been ignored and the instruction setcoding and syntax for every instruction has been described. In this early modelall the operations were executed in a single clock cycle, in order to check the vari-ous parts functionalities. In the second phase the pipelined structure behavior has

149

6 – LISARM support tools

been modelled and the LISA operations code has been distributed on the respectivepipeline stages. In all these steps the Processor Debugger played a fundamental roleand a vast set of assembly file allowed to test many processor functionalities.

All the instructions described in chapter 5 have been tested, by using the variousoperands syntax expression, addressing modes and conditional execution mnemonicsin the written assembly code. Assembler, linker, disassembler and LISATek simu-lator allowed an intensive verification work, in which, step by step, the LISARMmodel is growth respecting the original ARM7TDMI core behavior.

6.2 The memory wrapping

The LISARM model memory management differs from the ARM processor behaviorin some aspects, because the LISATek toolsuite and the LISA language do not allowto implement the same sub-word addressing features of the original processor. TheLISA model resource section (ref. paragraph 5.1.1) defines a 32-bit memory organi-zation, with 8-byte sub-blocks and a 32-bit address bus for the memory interfacing.By the fact that every increment in the value which drives the address bus producesa 32-bit displacement, the address bus itself furnishes not enough information to thememory management unit for sub-word sized data types accessing. The MemoryAccess Size (MAS) output expresses only the data size, so the position of a byte orhalfword which is not on a 32-bit boundary has to be communicated by using anadditional signal, the 2-bit BS line cited in paragraph 5.1.2. This signal is assignedcycle by cycle as the two least significant bits of the program counter (for instructionfetch) or the mem address reg (for a data access). To obtain an ARM-compliantmemory interface, the memory wrapper has to consider the least significant bits ofthe address bus as the thirty most significant bits of the real address, adding the BS

value as its two least significant bits. In any case, the BS value tells the real positionof the data wanted by the system, hence other approaches are allowed, with respectto the memory system specifications.

The other problem presented by the LISATek model is the replication of sub-wordsized data during the store instruction execution (STRB or STRH). The ARM pro-cessor furnishes the address of the exact location to access and assigns the same databyte on all the byte boundaries or the data halfword on the corresponding halfwordboundaries. The byte-by-byte access can be performed also with LISARM, resolvingthe real address by using the BS output value as described above. Moreover, if thememory system requires this feature, the byte or halfword data can be connectedto the other data bus lines dynamically, by observing the MAS output value. Thischoice, anyway, has to consider also power consumption aspects, which can adviseagainst its implementation.

150


Also the sub-word load operation executed by the ARM processor has particularfeatures that can not be ignored, because it expects the data to sample in the rightposition with respect to the addressed memory location. The memory managementunit has to evaluate the MAS signal to establish what size is the data required andthe wrapper must perform the real address calculation. The byte or halfword valuereplication on all the other lines is optional and for power saving considerations itcan be avoided.

All the other memory interface signals, as nMREQ, SEQ, nRW and also the mainclock input (MCLK) can be connected directly to the memory management unit.The nMREQ can be monitored in order to deactivate the wrapper when memoryaccesses are not required by the processor. The memory wrapper scheme is reportedin figure 6.1.

RAM

Memory Wrapper

SE

Q

nMR

EQ

nRW

MC

LK

data_busaddress_bus

MCLK

2BS

2MAS

2

MAS

Figure 6.1. Memory wrapper scheme

The ARM processor allows to select big-endian or little-endian configuration fordata read and write in memory by using a dedicated input (BIGEND). This signalcan also be changed during the program execution and the processor has to managethe data transfer in an appropriate manner cycle by cycle, at every time it accessesthe memory resource. The LISARM model does not allow the dynamic change ofthe endianess configuration, and this is due to the fact that LISATek tools and LISAlanguage does not support this feature. The model described supports only the littleendian memory organization, as selected in the resource section of the main.lisa

file. This represents another problem that a memory wrapper can solve, crossingthe single bytes by using a set of multiplexers when the big endian configuration isselected.

A final consideration has to be done: the insertion of a memory wrapper and itsinternal structure influences the access timings of the whole system, hitting a well

151


known weak point, as the memory access is.

6.3 ARM commercial toolchains

The expansion of the ARM7TDMI processor in mobile devices market and its embed-ding in micro-controllers and complex systems, has led to the diffusion of toolsuitesand development environments for software applications creation. By the fact ARMLtd. does not sell microprocessors but processor cores, many manufacturers has pro-duced some proprietary chips based on their processors and also the relative tools forsoftware applications development and optimization. The most part of these toolsaccepts C/C++ code and uses a compiler which target is the ARM7 instruction setand many commercial cross-compilers1 are also diffused. Cross-compiling tools aregenerally used to generate executable code for embedded systems or multiple plat-forms where it is inconvenient or impossible to compile, e.g. micro-controllers thatrun with a minimal amount of memory for their own purpose. During the modeldevelopment, some of these tools were used, like the GNU/Linux ARM Toolchainor WinARM. Both the applications implement cross-compiling toolchains and usethe GNU gcc C/C++ compiler to generate ARM72 executable binary files. Thecompilers accept many of the common gcc compiling options and also a completeset of tools like assembler, disassembler and linker are furnished.

In order to explore ARM processor functionalities and to perform some compar-isons between original ARM and the LISA model behavior, other instruments havebeen used, like the SimIt-ARM tools. The SimIt-ARM suite contains an instruction-set emulator and a cycle-accurate simulator for the StrongARM architecture3, whichis a 32-bit predecessor of the ARM7TDMI processor. The StrongARM processor isbased on the ARMv4 architecture, the oldest version of the ARMv4T architecturewhich implements the additional 16-bit Thumb ISA included in the ARM7TDMIprocessor. For the explained reasons the SimIt-ARM tools are full compatible withthe 32-bit ARM instruction set modelled by LISARM and both the instruction-setemulator and a cycle-accurate simulator had represented a valid alternative to thedata sheet[16] specification analysis solely.

1a cross-compiler is a compiler capable of creating executable code for a platform other thanthe one on which the cross compiler is run.

2also other members of the ARM processor family can be selected as the target architecture.3the StrongARM processor was a collaborative project between Digital Equipment Corporation

(DEC) and ARM Ltd., to create a faster CPU based on the existing ARM line; the core was latersold to Intel, who continued to manufacture it before replacing it with the XScale.

152


6.4 ARM model toolchain adaption

For some aspects introduced in the LISARM model description (chapter 5), theLISA language can not describe complex mechanisms for code assembling and sothe generated assembler does not implement the original ARM assembler capabili-ties. The immediate values conversion in the immed8 r format, used in single datatransfer and data processing instructions, and the register list conversion used inblock data transfer instructions present some aspects which are not implementablein the LISATek generated assembler. The obtained model accepts the same codingsof the two set of instructions, so that an application compiled with a commer-cial toolchain is compatible with the LISARM internal representation of immediatevalues and register list, but assembly source code for standard ARM processor cannot be assembled by using the LISATek generated assembler. These problems can besurmounted by using a pre-assembler tool which parse the ARM assembly code, re-trieve the instructions which uses the unsupported arguments and transforms themin a format suitable by the LISARM processor4.

As described in section 5.3, the ARM7 assembler accepts only particular im-mediate values in the assembly syntax, i.e. all those numbers representable by a32-bit wide unsigned integer obtained by sign extending an 8-bit unsigned value to32 bit and rotating right the result by a number of bit twice a 4-bit amount. Itis obvious that only some particular integer values can be transformed in the socalled immed8 r format, all powers of two, for example. The ARM assembler is ableto check if an expressed immediate value can be transformed as described and itreturns an error if the conversion is not feasible. If the assembly instruction is anarithmetic, move or compare operation, an alternative form can be exploited: bythe fact that everyone of these operations provide an opposite operation, the 1’scomplement form of the operand is used and the operation mnemonic contained inthe assembly code is changed. The negation or logical inversion ensures the ALUoperation equivalence and the data processing instructions which allow the so calledinstruction substitution are: ADD and SUB, ADC and SBC, AND and BIC, MOVand MVN, CMP and CMN. A particular decoding operation described within themodel is dedicated to the immed8 r conversion, it decodes the 8-bit wide immediatevalue and the 4-bit wide right rotation semi-amount and drives the barrel shifter torecreate the original value, which is then supplied to the ALU. The pre-assemblingtool takes the immediate value expressed in the instructions cited above and storesit in a 32-bit internal variable. By using an 8-bit mask and executing a numberof steps in which the mask is moved on the value binary representation, groups of8-bit are selected, to establish if the other bits are all zeros or all ones. Only if the

4the assembly syntax of these instruction has a more closer correlation with the binary codingfields.

153


other bits are all zero the value is representable in the immed8 r format and theinstruction substitution is not required. Otherwise, if the other bits are all ones, theinstruction substitution is needed and a bitwise not operation must be performedto obtain its 1’s complement. The 8-bit group selected is then re-converted in anumeric value and also the amount for the right rotation is saved. These valuesare then opportunely converted in an assembly string and in case an instructionsubstitution is required the appropriate mnemonic is used. The instruction in theassembly file is substituted by the supported format and this operation is executedfor every other instruction contained in the source file before the control returns tothe user. If an immediate value can not be transformed as wanted, an error message,reporting the code line number, is displayed and the program exits.

The LISARM disassembler produces a code with the same syntax characteristicsof the accepted format by the fact the same decoding LISA operation is used togenerate both the assembler and the disassembler. To obtain an ARM compliantdisassembly, the LISARM disassembled file has to be modified by another ad hoctool, the post-disassembler, which calculates the immediate value as the processordatapath does, starting from the 8-bit and 4-bit values reported in the disassemblysyntax and substituting the result in the file.

The block data transfer instructions (5.9) accept a list of general purpose registersto be transferred, which can be expressed in various manner: the complete setor a subset of the sixteen general purpose registers can be selected and all thecombinations are allowed. The single registers can be expressed separating theirnames by commas and the numerical order has to be respected. Consecutive registerscan be grouped by using the“-”symbol and the notation implies that all the registersincluded within the external identifiers must be transferred. The pre-assembling toolfinds LDM and STM opcodes expressed in the assembly file and parses the registerlist in their syntax. For every register index included in list the corresponding bitof an internal 16-bit variable is tied high and the variable bits are then inspectedto recreate an explicit list, i.e. a list in which single registers identifiers appearin numerical order, separated by commas and without group notation. The blockdata transfer decoding LISA operation is build to accept the list of registers to betransferred as defined in the same way so the tool generated syntax substitutes theinstructions retrieved in the assembly file. The decoding LISA operation influencesalso the generated disassembler behavior, which produces the same list of registersby checking which flags are high in the corresponding binary format, so it furnishesthe explicit list without grouping symbols.

A variant of the described tool can also be used to obtain the converted format forthe immediate value to be expressed within the assembly code, in order to executesimulations an tests on the model or to check if the internal format corresponds tothe expected value. The same thing can be done with the register list for block datatransfer operations.

154


The diagram of the complete toolchain is reported in figure 6.2.

LISARM

assemblerassembler

pre−

LISARMLISARM

disassembler

LISARM

post−

disassembler

ARM

assembler

assembly files

C−compiler

C files

C libraries

binary file disassembly file

Figure 6.2. Complete toolchain diagram

6.5 HDL generation and tests

The LISARM model hardware description has been generated by using the LISATekHDL Generator, selecting the VHDL as the target language. Inspecting the gener-ated files it is possible to find a single description file for each unit described in themodel, hence instruction decoder, condition checker, pipeline registers and pipelinecontroller, ALU, multiplier and barrel shifter have been implemented as separateVHDL entities. The execution stage contains also the various units which imple-ment the instruction groups like data processing, memory access, branch and all theother operations are grouped in fetch, prefetch, decode and execute entities.A LISA processor description, as discussed in chapter 4, can be used as a universalsource for the generation of both software tools (assembler, linker, simulator) andRTL code (using HDL languages). While the universality is a fundamental strengthof the LISA model, there is the challenge that the generation of software tools andRTL code represent different abstractions of the processor. A software model isexecuted sequentially, an RTL simulator must emulate the parallelism which is in-herent in hardware. As a consequence, a LISA optimal description for software toolscreation can not be as optimized as needed for hardware description generation. Bythe fact the HDL generation is only the last development phase, while the processor

155


behavior is analysed step by step using the software tools, the LISA description stylemust consider target hardware aspects from the very beginning. During the modeldevelopment the HDL compiler has been used many times to evaluate the correct-ness of the written LISA code. The first step to obtain a feasible hardware is theselection of the right LISA resources: local and global variables, used for values com-munication between LISA operations, has been chosen keeping in mind LISA andRTL resources mapping. An appropriate TClocked data types selection has allowedto obtain the wanted behavior by the processor, particularly in the cycle-accuratemodeling phase. Signals scope and their initialization had represented another im-portant aspect to take in account. As discussed in chapter 4, LISA operation reuseallows to obtain noteworthy hardware optimizations, here the right signals selectionplays a fundamental role for every read and assignment operation, because an RTLsignal can be driven only by a unit per time and the other components can onlyread its value. In order to understand which modeling style was the most appro-priate, many LISA code compilations and simulations were executed, so the HDLGenerator had represented another fundamental tool for the model development.The LISATek HDL Generator, beside the VHDL and Verilog description, is able toprovide the ModelSim configuration files for the RTL code simulation. The LISATekGUI allows to select a machine language file for the architecture simulation and,during the tools generation flow, its memory dump is automatically stored in a file.The LISATek tools generation flow creates also a ModelSim configuration file, whichsets up the simulation environment for the HDL generated code testing. LaunchingModelSim by this configuration file, the selected memory dump file is loaded in theprocessor resources, and a step by step architecture behavior verification can beexecuted. The ModelSim simulator furnishes many precious instruments for hard-ware behavior in-depth analyses, particularly by the wave viewer support. Sincethe first steps of LISARM development, hardware simulations allowed to discovermodel behavior inconsistencies and the subsequent correction of some parts of theLISA description.

156


Figure 6.3. A ModelSim simulator screenshot

157

Chapter 7

Conclusions and possible futureapplications

The chapter contains some conclusive considerations about the thesis work presentedin the volume and sketches out some future applications of the LISARM modelrealized.

7.1 Conclusions

The LISATek toolsuite, and particularly the LISA description language, has demon-strated to be an optimal software for the LISARM architecture development, thanksto the complexity tackling approach in the first place. The choice of beginning themodel development by implementing a instruction-accurate model, instead of start-ing with its cycle-accurate description, has allowed to concentrate the efforts onthe assembly syntax and the coding scheme imposed by the ARM instruction set.Here the LISA language tricks have simplified the ARM coding format descrip-tion, allowing the grouping of sparse bits into blocks to be decoded separately andthe LISA decoding mechanisms were exploited to spread the operation complex-ity on many sub-steps. The subsequent phase, which concerned the shift from theinstruction-accurate to the cycle-accurate model, has allowed to focus the atten-tion on instruction cycle timing aspects, exploiting LISA language capabilities forpipeline structure description and its events management and scheduling, in a verysimple manner.

Some of the LISA language mechanisms for multi-cycle instruction descriptionwere not be accepted by early LISATek HDL Generator versions and the real pro-cessor behavior was achievable only for simulation purposes, by using a schedulingmechanism then improved. By the fact the latest LISATek toolsuite versions, furnishthe LISA operation polling capabilities, their rescheduling is allowed and the ARM

158

7 – Conclusions and possible future applications

processor behavior for multi-cycle operations can be respected. A big challenge withLISATek was the adaptation of the description style to the hardware structural as-pects. Here a good HDL knowledge is required, in order to obtain an optimizedarchitecture description and to enable the hardware hierarchy maintaining, whichallows a subsequent optimization by exploiting HDL architectures and libraries usedby synthesis tools.

The LISATek toolsuite is a collection of software applications in uninterruptedgrowth and also if some capabilities can be implemented in the LISA language de-scription, they are not completely supported by all the tools belonging to the suite.The fundamental concept of obtaining a simulator, HDL description, toolchain (com-piler, assembler, linker, disassembler) from a single description and the possibility ofco-simulate its hardware and software model, makes LISATek an ideal developmentapproach, where architecture refinements or wide modifications allow to save timeand money, avoiding the maintenance of all the model parts and tools.

7.2 Possible future applications

Starting from the LISARM model, some future applications can be obtained. Someof the model limits can be surpassed by using a more complete and efficient versionof the LISATek toolsuite, but maintaining the most part of the LISA description.Some features supported by the Processor Generator, like intercommunication busdefinition and full dynamic memory interfacing, are not yet supported by the HDLgenerator, so they can not be used in model description. Also the big endian memoryorganization is not supported by the current versions of the tools, although externalhardware solutions can be adopted. The memory wrapper described in chapter 6resolves another LISATek problem about the memory management but an efficientuse of sub-word sized data can not be left to external units.

The processor model object of this work is not intended to be a full ARM compli-ant clone, hence some ARM7TDMI architecture aspects were intentionally ignoredsince the first platform exploration steps. The coprocessor communication capabil-ities and the relative ARM instructions are not implemented in the model, becausethe LISA description allows the integration of the present instruction set withoutrequiring external dedicated cores, so that a similar approach appeared quite super-fluous. Beside ARM instruction set extensions for specific applications, also unusefulinstructions removing is allowed, so that an optimal instruction set architecture canbe obtained.

The Thumb micro-architecture and the relative instruction set implemented byARM, which allow 16-bit instructions be executed on a full featured 32-bit archi-tecture, seems to be a valid strategy to reach an increased code density so it can

159

7 – Conclusions and possible future applications

be considered as an enhancement of the obtained model. The dynamic translationmethod, which transforms the Thumb 16-bit instructions to the ARM 32-bit ones,can be exploited also in the LISARM model, in fact the LISA language allows theimplementation of complex mechanisms used also for VLIW processor description.Here the problem of the sub-word data addressing arise another time but, for theconsiderations explained above, some solutions could be found in future toolsuitereleases.

Another deep revision of the obtained model can target the realisation of anHarvard architecture, with the instruction throughput advantages which can derive.Adding the appropriate pipeline stage and maintaining the most part of the LISAcode which describe the single instructions behavior, is possible to move the memoryaccess capabilities to the added stage, without the burden of re-engineering all thesystem and the pipeline controller. Before the model is used in embedded appli-cation, an intensive testing phase has to be executed, so that the LISARM modelspecification can be validated in-depth and all the processor architecture parts canensure full ARM instruction set compliance. Beside the simulator, also the gener-ated HDL description must be verified and then its synthesis can be performed withcommercial tools, to allow time and area parameters to be extracted. A subsequentfunctional verification phase can be executed also using the LISATek co-simulationtools, so that the real hardware description can be tested with the same patternsused for the simulator refinement, checking the response of both the models at thesame time.

160

Appendix A

Model LISA operations summary

This appendix contains a summary of all the LISA operations contained in themodel. For each operation pipeline stage, decoding group (see figure 5.2 and filewhich is assigned is indicated.

Table A.1. LISA operations summaryLISA operation Stage Decoding group File (.lisa)ADC dc DC arith logic grp data proc instructionsADD dc DC arith logic grp data proc instructionsadd PSR update EX none alu operationsAL dc DC condition conditionfieldAL ex EX none conditionfieldalu adc EX none alu operationsalu add EX none alu operationsalu and EX none alu operationsalu bic EX none alu operationsalu eor EX none alu operationsalu mov EX none alu operationsalu mvn EX none alu operationsalu operation EX none alu operationsalu orr EX none alu operationsalu rsb EX none alu operationsalu rsc EX none alu operationsalu sbc EX none alu operationsalu sub EX none alu operationsAND dc DC arith logic grp data proc instructionsarith logic grp DC data proc grp data proc instructionsarith logic grp setup EX none data proc instructions

161

A – Model LISA operations summary

LISA operation Stage Decoding group File (.lisa)ASL dc IN DC bs op barrel shifterASR dc IN DC bs op barrel shifterB dc DC branch grp branch instructionsB ex EX none branch instructionsbarrel shifter op EX none barrel shifterbarrel shifter op dc DC none barrel shifterBIC dc DC arith logic grp data proc instructionsBL dc DC branch grp branch instructionsBL ex poll EX none branch instructionsblock data grp DC mem access grp mem access instructionsblock mem access ex poll EX none mem access instructionsbranch grp DC none branch instructionsbranching flag reset FE none branch instructionsBX dc DC branch grp branch instructionsBX ex EX none branch instructionsCC dc DC condition conditionfieldCC ex EX none conditionfieldCMN dc DC cmp grp data proc instructionsCMP dc DC cmp grp data proc instructionscmp grp DC data proc grp data proc instructionsCPSR DC none other instructionsCPSR all DC none other instructionsCPSR selection DC none other instructionsCS dc DC condition conditionfieldCS ex EX none conditionfielddata proc grp DC none data proc instructionsdata sample op EX none mem access instructionsdecode DC none mainEOR dc DC arith logic grp data proc instructionsEQ dc DC condition conditionfieldEQ ex EX none conditionfieldfetch FE none mainGE dc DC condition conditionfieldGE ex EX none conditionfieldGT dc DC condition conditionfieldGT ex EX none conditionfieldHI dc DC condition conditionfieldHI ex EX none conditionfieldimm branch offset DC none branch instructions

162


LISA operation Stage Decoding group File (.lisa)immed 8r DC none data proc instructionsimmediate offset DC none mem access instructionsimmediate operand DC none data proc instructionsimmediate value 12 DC none misc opsimmediate value 4 DC none misc opsimmediate value 5 DC none misc opsimmediate value 8 DC none misc opsimplicit PSR update req DC none misc opsLDM dc DC block data grp mem access instructionsLDR dc DC std data grp mem access instructionsLDRH dc DC su data grp mem access instructionsLDRSB dc DC su data grp mem access instructionsLDRSH dc DC su data grp mem access instructionsLE dc DC condition conditionfieldLE ex EX none conditionfieldlogop PSR update EX none alu operationsLS dc DC condition conditionfieldLS ex EX none conditionfieldLSL dc IN DC bs op barrel shifterLSR dc IN DC bs op barrel shifterLT dc DC condition conditionfieldLT ex EX none conditionfieldmain none none mainmem access ex poll EX none mem access instructionsmem access grp DC none mem access instructionsmem access preset dc DC none mem access instructionsMI dc DC condition conditionfieldMI ex EX none conditionfieldMLA ex poll EX none multiply instructionsMLAL ex poll EX none multiply instructionsMOV dc DC mov grp data proc instructionsmov grp DC data proc grp data proc instructionsMRS dc DC none other instructionsMRS ex EX none other instructionsMSR dc DC none other instructionsMSR ex EX none other instructionsMSR flg dc DC none other instructionsMSR flg ex EX none other instructionsMSR immed 8r EX none other instructions

163


LISA operation Stage Decoding group File (.lisa)MSR reg op DC none other instructionsMUL dc DC multiply grp multiply instructionsMUL ex poll EX none multiply instructionsMULL dc DC multiply grp multiply instructionsMULL ex poll EX none multiply instructionsmultiply grp DC none multiply instructionsMVN dc DC mov grp data proc instructionsNE dc DC condition conditionfieldNE ex EX none conditionfieldnon shifted reg DC none data proc instructionsNOP dc DC none misc opsNOP ex EX none misc opsORR dc DC arith logic grp data proc instructionsother grp DC none other instructionsPL dc DC condition conditionfieldPL ex EX none conditionfieldpost indexed DC none mem access instructionspost store op EX none mem access instructionspre indexed DC none mem access instructionsprefetch PF none mainprogram relative DC none mem access instructionsPSR access grp DC none other instructionsPSR no update req DC none misc opsPSR update req DC none misc opsreg amount shifted reg DC none data proc instructionsreg index DC none misc opsregister list conv DC none misc opsreset none none mainROR dc IN DC bs op barrel shifterRRX dc IN DC bs op barrel shifterRSB dc DC arith logic grp data proc instructionsrsb PSR update EX none alu operationsRSC dc DC arith logic grp data proc instructionsSBC dc DC arith logic grp data proc instructionsshift amount reg access EX none data proc instructionsshifted reg DC none data proc instructionsshifted reg offset DC none mem access instructionsshifted reg operand DC none data proc instructionsSPSR DC none other instructions

164


LISA operation Stage Decoding group File (.lisa)SPSR all DC none other instructionsSPSR selection DC none other instructionsstd data grp DC mem access grp mem access instructionsSTM dc DC block data grp mem access instructionsSTR dc DC std data grp mem access instructionsSTRH dc DC su data grp mem access instructionssu data grp DC mem access grp mem access instructionsSUB dc DC arith logic grp data proc instructionssub PSR update EX none alu operationsSWI dc DC other grp other instructionsSWI ex EX other grp other instructionsSWP dc DC other grp other instructionsSWP ex poll EX other grp other instructionssymb branch offset DC none branch instructionsTEQ dc DC cmp grp data proc instructionsTST dc DC cmp grp data proc instructionsunc dc DC condition conditionfieldunc ex EX none conditionfieldVS dc DC condition conditionfieldVS ex EX none conditionfieldwrite result EX none alu operationswriteback op EX none mem access instructionszero offset DC none mem access instructions

165

Bibliography

[1] Von Neumann architecture:http://en.wikipedia.org/wiki/Von Neumann architecture,Wikipedia, the free encyclopedia.

[2] D. A. Patterson, D. R. Ditzel, “The case for the Reduced Instruction SetComputer”, ACM SIGARCH Computer Architecture News, Vol. 8, Issue 6,October 1980, pp. 25–33.

[3] CPU design:http://en.wikipedia.org/wiki/CPU design,Wikipedia, the free encyclopedia.

[4] Reduced Instruction Set Computer:http://en.wikipedia.org/wiki/RISC,Wikipedia, the free encyclopedia.

[5] L. Xing, G. Fernandes, P. Kulkarni, S. R. Marupudi, S. P. Melacheruvu, M. Pra-gada, “RISC versus CISC - Project report for computer architecture”, Univer-sity of Massachussets - Dartmouth.

[6] ARM architecture:http://en.wikipedia.org/wiki/ARM architecture,Wikipedia, the free encyclopedia.

[7] I. J. Hunag, Y. Liang Hung, C. S. Lai, “Cost-effective microarchitecture opti-mization for ARM7TDMI microprocessor”, National Sun Yat-Sen University -Taiwan.

[8] CoWare website: www.coware.com.

[9] LISA Language Reference Manual - CoWare - Product Version V2005.2.1 - Feb2006.

[10] Tensilica website: www.tensilica.com.

[11] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen,A. Wieferink, H. Meyr, “A novel methodology for the design of Application-Specific Instruction-Set Processors (ASIPs) using a machine description lan-guage”, IEEE Transactions on Computer-Aided Design of Integrated Circuitand Systems, Vol. 20, No. 11, November 2001, pp. 1338–1354.

166

Bibliography

[12] O. Schliebusch, A. Hoffmann, A. Nohl, G. Braun, H. Meyr, “Architecture imple-mentation using machine description language LISA”, Proceedings of the 15thInternational Conference on VLSI Design (VLSID) 2002, IEEE Computer So-ciety, 2002.

[13] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen,H. Meyr, “A methodology for the design of Application-Specific Instruction-Set Processors (ASIP) using the machine description language LISA”, IEEEComputer Society, 2001.

[14] LISATek Release Informations - CoWare - Product Version V2005.2.1 - Feb2006.

[15] LISATek Methodology Guidelines for the Processor Generator - Product VersionV2005.2.1 - Feb 2006.

[16] ARM7TDMI Data Sheet, Advanced RISC Machines Ltd. (ARM), August 1995.[17] ARM Developer Suite - Assembler Guide, ARM Ltd., 2001, version 1.2.[18] S. B. Furber, ARM System-on-Chip Architecture, Addison Wesley Longman,

March 2000, 2nd edition.[19] P. Knaggs, S. Welsh, ARM: assembly language programming, Bournemouth

Univerity School of design, engineering and computing, August 2004.[20] LISATek Processor Designer Manual - CoWare - Product Version V2005.2.0 -

Dec 2005.[21] LISATek Creation Manual - CoWare - Product Version V2005.2.0 - Dec 2005.[22] LISATek Processor Debugger Manual - CoWare - Product Version V2005.2.0 -

Dec 2005.

167

lisarm: embedded arm platform design and optimizationpolitecnico di torino iii facoltµa di...

Documents