資訊學院研究能量與研發成果

資訊學院研究能量與研發成果

Architecture and Systems 研究群報告人：單智君陳昌居鍾崇斌

中華民國 95 年 11 月 30 日

2

資訊學院研究群「資訊科學與工程研究所」研究群

architecture and systems

鍾崇斌、單智君、陳昌居

3

Architecture and SystemsResearch Directions

Embedded processor and SoC Java processor, JIT compilation &VM DSP designs and compilation Low-power systems Graphic processor Superscalar ARM processor Reconfigurable computing Asynchronous circuits

4

Architecture and SystemsR&D Results ARM9-compatible processor with video/audio

capabilities Java stack operations folding Memory Constrained Java Just-in-time Compiler Asynchronous 8051 for low-power SOC

applications DSP– instruction set extensions Low-power Branch-Target-Buffer Low-power bus encodings Low-power cache memory Graphic processor design techniques Superscalar ARM Reconfigurable computing

5

ARM9-compatible Processor with Audio/Video Capabilities

ARMAVP (ARM Audio Video Processor) 為 32位元微處理器，採用負載平衡良好的五階管線設計，分別為 Fetch Unit 、 Decoder Unit 、Execution Unit 、 Memory access Unit 以及 Write Back Unit 。對各階的設計進行效能的最佳化，以提高時脈頻率，並提供有效率的機制，降低了因為記憶體速度太慢對微處理機效能上的影響

特性支援 Conditional Execution ABP 緩衝器設計改良指令抓取所需時間精確中斷控制結構非同步的記憶體存取動態暫存器組的映射分支指令的快速處理多功能有效率的執行路徑分散式指令控制編碼

功能驗證與評估所有功能已在 Altera EP20K600EBC652-1 上完

成驗證。根據 Decode Stage 之模擬結果，在 FPGA 上可工作於 45MHz ，預期實做為晶片時可達210MHz

6

Java Stack Operations Folding

JVM: Stack Based MachineJVM Performance Bottleneck: Stack Operation Dependency

Constant Register (CR)

Local Variable (LV)

Branch Unit

Complex Instr.

Local Variable (LV)

Execution Unit

Producer (P)

Operator (O)

Consumer (C)

Operand S

tack

1

2

4

5

3

Constant Register (CR)

Local Variable (LV)

Branch Unit

Complex Instr.

Local Variable (LV)

Producer (P)

Operator (O)

Consumer (C)

Operand S

tack

1‘=1 fold 2

5‘=4 fold 5

3

Before Folding After Folding

Execution Unit

7

Memory Constrained Java Just-in-time Compiler

Mixed mode execution Complex bytecode is executed by the interpreter

Fast compilation Two pass compilation Simple but effective optimizations About 300 cycles per bytecode

Small memory usage About 23KB for static footprint 4KB code buffer is sufficient for common usage

8

Asynchronous 8051 for Low-Power SOC Applications

SA8051 (Balsa Asynchronous 　 8051) 為一個 8 位元低耗電量微控制器，相容於 Intel MCS-51 ，採用非同步電路方式設計，動態耗電量約為同步版本的三分之一。特性

- 無中央時脈 - 4-phase 交握的設計 - soft-core 處理器 - 低耗電量

　　 - 透過交握介面與同步 IP 整合 - 針對資料與控制路徑做最佳化

功能驗證與評估所有功能已在 Xilinx FPGA Spartan IIE 300 ft256 上完成驗證。根據 XPower 之模擬結果，動態耗電量約為同步版本的三分之一。

CPUROM

RAM

activate_0r reset_0d

P0_outP1_outP2_outP3_out

P0_in P1_in P2_in P3_in

rom__addr_0r

rom__addr_0a

rom_addr_0d

ram

__ad

dr_0

rra

m__

addr

_0a

ram

__ad

dr_0

d

ram

__rN

w_0

rra

m__

rNw

_0a

ram

__rN

w_0

d

ram

__in

data

_0r

ram

__in

data

_0a

ram

__in

data

_0d

ram

__ou

tdat

a_0r

ram

__ou

tdat

a_0d

ram

__ou

tdat

a_0d

rom_en

rom_addr

rom__data_0r

rom__data_0a

rom_data_0d

rom_data

Rom_rfd

ram

_en

ram

_wr

ram

_add

r

inda

ta

outd

ata

Ram

_rfd

reset_0r

reset_0a

handshake interface

hand

shak

e in

terf

ace

5 MHz

9

DSP– Instruction Set Extensions

Current directions Application-specific instruction set extensions (ISE)

generation Why ISE ?

Improvement performance. Keep flexibility and efficiency of original processor

What is ISE ? Group frequently executed instruction patterns to be

an extended instruction Executed in extra hardware, “Application Specific

Functional Unit (ASFU)”

Register File

ALU MUL LD/ST ASFU…..

Main Memory

10

DSP– Instruction Set Extensions (cont.)

Current research topics Multiple-issue architecture

Exploring ISE in a multiple-issue architecture, such as superscalar or Very Long Instruction Word (VLIW)

Hardware reusebility Reuse same or similar hardware resources in differe

nt ASFUs while keep same performance Overcome register file read/write port constraint

Try to schedule the input and output of ASFU at different time slots

11

BTB lookup operations of non-branch instructions are useless and only waste power

Branch Distance Generation and Collection將兩相鄰分支指令間的非分支指令個數蒐集紀錄。

Next Upcoming Branch Instruction Location取得下一道分支指令的位置並且在其來臨前停止所有 BTB Lookup 動作。

Low-power Branch Target Buffer

Branch Distance

Table

12

Low-power Bus Encodings

在此我們針對不同的匯流排架構的特性，提出了不同的低電耗匯流排編碼系統。我們的編碼系統利用了各種編碼方法，將藉由匯流排傳輸的資料，以最具有電耗效率的方式來傳送，達到省電的效果。

低電耗匯流排編碼系統

處理器指令位址匯流排

T0 + Discontinuous Address Table指令匯流排

BIBITS with Register Relabling

指令記憶體

資料記憶體

資料位址匯流排T0_BI_1,Variable-Stride,SRWEC

資料匯流排Leading-bytes encoding

處理器指令、位址混和之位址匯流排

I/D Selector,T0 DAT+Stride-Table指令、位址混和之匯流排

I/D Selector,BIBITS_RR+Leading-bytes

記憶體

匯流排編碼架構傳送端

編碼器原始資料

接收端

解碼器編碼過的資料

額外控制線路原始資料

13

Low-power Cache Memory

快取記憶體佔有整體處理器超過 50% 之功耗

低功耗快取記憶體設計 Loop Buffer: 將 loop co

de 置入低耗電存取之 loop buffer 中以節省指令擷取之功耗

Power Manager: 將不常使用之快取記憶體區塊置入低耗電模式以節省快取記憶體之靜態功號。

Low-power mode

Loop Buffer

CPU

70%

30%

Power Manager

Normal mode

Low-power mode

Low-power mode

Normal mode

low-power accesses

normal accesses

14

Graphic Processor

Pixel ShaderColor

ShaderVertex

Processing

Texture Shader

Pixel Processing

Triangle Setup

Vertex Shader Clip

Vertex

V.S. Prog.

P.S. Prog.

Rendering

Depth Processing

Final Pixel

Pixel ShaderColor

ShaderVertex

Processing

Texture Shader

Pixel Processing

Triangle Setup

Vertex Shader Clip

Vertex

V.S. Prog.

P.S. Prog.

Rendering

Depth Processing

Final Pixel

1

23 4

5

研究目的︰進行新一代繪圖處理器架構研究，於像素著色器 (Pixel Shader) 、材質 (Texture) 及深度處理 (Depth Processing) 等三大方向提出硬體架構及軟體驗證環境。目前成果分項說明如下︰

1. A dynamically reconfigurable graphics hardware for resource reallocatable rendering pipeline

2. A Reconfigurable Texture Mapping Architecture 3. Implementation of texture Compression by GPU Driver 4. Register Renaming for Pixel Shaders data/value management5. Instruction scheduling mechanism for 3D GPU pixel shader6. An Efficient Texture Memory System Designs7. Alpha Blending without Z Sort

6

15

Superscalar ARM Goal: a superscalar embedded processor featuring

800MHz clock rate @ 0.13um 1.8DMIPS / MHz – superscalar performance under tough pipeline late

ncy 800K gate count – cost-effective design

Directions and achievements Micro-architecture

A 12-stage dual-issue superscalar processor with good instruction fetch rate, issue rate, and efficient forwarding

Simulator A cycle-accurate simulator modeling more details than the well-know

n simplescalar simulator Compiler

Working on GCC machine description to optimize performance

16

Reconfigurable Computing

Motivations:Motivations: Improving the Design MethImproving the Design Meth

odology of Embedded Systeodology of Embedded System Hardwarem Hardware

Providing a Better PerformaProviding a Better Performance with Low Development nce with Low Development Cost Cost

Shorting the Time-to-MarkeShorting the Time-to-Market of SoC Productst of SoC Products

Research Issues:Research Issues: Hardware/Software PartitioHardware/Software Partitio

nn Synthesize TechnologySynthesize Technology Reconfigurable Processing Reconfigurable Processing

Element DesignElement DesignReconfigurable Architecture

Processor(ARM7 / MIPS)

On-Chip Mem /Cache Mem

Data Engine

ReconfigurableLogic

ConfigurationControllor

Main busData bus

Memory Management Unit

External bus

Off-ChipMemory

Memory-mappedIO

( 1 / 2 )

17

Reconfigurable Computing (cont.) ( 2 / 2 )

Detailed Design of Reconfigurable ArchitectureDetailed Design of Reconfigurable Architecture

A Design of Reconfigurable Architecture

CMPE

CMPE

CMPE

CMPE

CM

PE

CMPE

CMPE

CM

PE

CMPE

Data L

oad Unit

Data S

tore Unit

C o n f i g u r a t i o n I n t e r f a c e

CO

BA

Y1 Y0

ShO1ShO

2

CI

ShI1

ShI2

ComputationBlock

CB

CO

BA

Y1 Y0

ShO1ShO

2

CI

ShI1

ShI2

ComputationBlock

CB

CO

BA

Y1 Y0

ShO1ShO

2

CI

ShI1

ShI2

ComputationBlock

CB

CO

BA

Y1 Y0

ShO1ShO

2

CI

ShI1

ShI2

ComputationBlock

CB

Input Dispatcher

Output Connector

HighByte LowByteConfiguration

Configuration

ConfigurationMemory

Scaleable Design of PE

Published Research Results:Published Research Results:• Run-time Reconfigurable Scheduling of 3D-Rendering on a Reconfigurable Run-time Reconfigurable Scheduling of 3D-Rendering on a Reconfigurable System (CCCT’05)System (CCCT’05)• Design and Implementation of a Reconfigurable Hardware for Secure Design and Implementation of a Reconfigurable Hardware for Secure Embedded Systems (ASIACCS’06)Embedded Systems (ASIACCS’06)

資訊學院 研究能量與研發成果

Documents

資訊學院研究能量與研發成果