what is a compiler when the architecture is not hard(ware) ?

UCB November 8, 2001

Krishna V Palem

Georgia Tech

What Is a Compiler When The What Is a Compiler When The Architecture Is Not Hard(ware) ?Architecture Is Not Hard(ware) ?

Krishna V Palem

This work was supported in part by awards from Hewlett-Packard Corporation, IBM, Panasonic AVC, and by DARPA under Contract No. DABT63-96-C-0049 and Grant No. 25-74100-F0944.

Portions of this presentation were given by the speaker as a keynote at the ACM LCTES2001 and as an invited speaker at EMSOFT01

2

UCB – November 8, 2001

Krishna V Palem

Georgia Tech

Embedded Computing

Why ?

What ?

How ?

3


Krishna V Palem

Georgia Tech

The Nature of Embedded Systems

Visible Computing View is of end application

(Hidden computing element)

4


Krishna V Palem

Georgia Tech

Supported by Moore’s (second) law– Computing power doubles every eighteen monthsCorollary: cost per unit of computing halves every eighteen months

From hundreds of millions to billions of units

Projected by market research firms (VDC) to be a 50 billion+ space over the next five years

High volume, relatively low per unit $ margin

Favorable Trends

5


Krishna V Palem

Georgia Tech

Embedded Systems Desiderata

Low Power

- High battery life

Small size or footprint

Real-time constraints

Performance comparable to or

surpassing leading edge COTS technology

Rapid time-to-market

6


Krishna V Palem

Georgia Tech

Timing Example

Predictable Timing BehaviorUnpredictable Timing

Behavior

Video-On-Demand

7


Krishna V Palem

Georgia Tech

Current Art

Meet desiderata while overcoming NRE cost hurdles through volume

High migration inertia across applications Long time to market

Vertical application domains

Industrial Automation

Telecom

Select computational kernels To ASIC

8


Krishna V Palem

Georgia Tech

Subtle but Sure Hurdles

For Moore’s corollary to be true– Non-recurring engineering (NRE) cost must be amortized

over high-volume– Else prohibitively high per unit costs

Implies “uniform designs” over large workload classes– (Eg). Numerical, integer, signal processing

Demands of embedded systems– “Non uniform” or application specific designs– Per application volume might not be high– High NRE costs infeasible cost/unit– Time to market pressure

9


Krishna V Palem

Georgia Tech

The Embedded Systems Challenge

Sustain Moore’s corollary– Keep NRE costs down

Multiple application domains

3D Graphics


Medtronics E-Textiles Telecom

Rapidly changing application

kernels in moderate volume

Custom computing solution

meeting constraints

10


Krishna V Palem

Georgia Tech

Responding Via Automation

Multiple application domains

3D Graphics


Medtronics E-Textiles Telecom

Rapidly changing application

kernels in low volume

Automatic

Tools

Power Timing

Size

Application specific design

11


Krishna V Palem

Georgia Tech

Three Active Approaches

Custom microprocessors

Architecture exploration and synthesis

Architecture assembly for reconfigurable computing

12


Krishna V Palem

Georgia Tech

Custom Processor Implementation

High performance implementation Customized in silicon for particular application domain O(months) of design time Once designed, programmable like standard processors

Proprietary

Tools

Proprietary ISA,

Architecture

Specification

Application

analysis

Fabricate

Processor

Custom

Processor

implementation

Application

Language

with custom

extensions

Compiler BinaryProprietary

ISA

Time Intensive

Tensilica, HP-ST Microelectronics approach

13


Krishna V Palem

Georgia Tech

Architecture Exploration and Synthesis

The PICO VisionProgram In

Automatic synthesis of

application specific

parallel / VLIW

ULSI microprocessors

And their compilers

for embedded computing

Chip Out“Computer design for the masses”

“A custom system architecture in 1 week tape-out in 4 weeks”B Ramakrishna Rau “The Era of Embedded Computing”, Invited talk, CASES 2000.

14


Krishna V Palem

Georgia Tech

Custom Microprocessors

Application(s)

define workload

Optimizing

Compiler

Analyze

Define ISA extension

(eg) IA 64+

Define Compiler

Optimizations

Design

Implementation

Microprocessor

(eg) Itanium +

15


Krishna V Palem

Georgia Tech

Application Specific Design

Single

Application

Program Analysis

Analyze

Library of possible

implementations

(Bypass ISA)

Explore and

Synthesize

implementations

VLIW Core + Non programmable extension

Applications

Application specific processor runs single application

Extended EPIC Compiler

Technology

16


Krishna V Palem

Georgia Tech

The Compiler Optimization Trajectory

Frontend and Optimizer

Determine Dependences

Determine Independences

Bind Operations to Function Units

Bind Transports to Busses



Execute

Superscalar

Dataflow

Indep. Arch.

VLIW

TTA

Compiler Hardware



B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.

17


Krishna V Palem

Georgia Tech

What Is the Compiler’s Target ``ISA’’?

Target is a range of architectures and their building blocks

Compiler reaches into a constrained space of silicon

Explores architectural implementations

O(days – weeks) of design time

Exploration sensitive to application specific hardware modules

Fixed function silicon is the result

Verification NRE costs still there

One approach to overcoming time to market

Frontend and Optimizer







Execute

Superscalar

Dataflow

Indep. Arch.

VLIW

TTA

Compiler Hardware



B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.

18


Krishna V Palem

Georgia Tech

Choices of Silicon

High level

design/synthesis

(EDIF) Netlist

Fixed silicon implementation

Standard cell design etc.

Emulated i.e.

Reconfigurable target

19


Krishna V Palem

Georgia Tech

Reconfigurable Computing

20


Krishna V Palem

Georgia Tech

FPGAs As an Alternative Choice for Customization

Frequent (re)configuration and hence frequent recustomization

Fabrication process is steadily improving

Gate densities are going up

Performance levels are acceptable

Amortize large NRE investments by using COTS platform

21


Krishna V Palem

Georgia Tech

Major Motivations

Poor compilation times– Lack of correspondence between standard IR and final

configurations

– Place and route inherently complex

Would like to have compile times of the order of seconds to minutes

Provide customization of data-path by automatic analysis and optimization in software

22


Krishna V Palem

Georgia Tech

Adaptive EPIC

Adaptive Explicitly Parallel Instruction ComputingSurendranath TallaDepartment of Computer Science, New York UniversityPhD Thesis, May 2001 Janet Fabri award for outstanding dissertation.

Adaptive Explicitly Parallel Instruction ComputingKrishna V. Palem and Surendranath Talla, Courant Institute of Mathematical Sciences; Patrick W. Devaney, Panasonic AVC American Laboratories Inc.Proceedings of the 4th Australasian Computer Architecture Conference, Auckland, NZ. January 1999

23


Krishna V Palem

Georgia Tech

CompilerCompiler

ISA

ADD

format

semantics

LDformat

semantics

Compiler-Processor Interface

Sourceprogram

Executable

RegistersExceptions..

24


Krishna V Palem

Georgia Tech

CompilerCompiler

Redefining Processor-Compiler Interface

Let compiler determine the instruction sets (and their realization on chip)

Let compiler determine the instruction sets (and their realization on chip)

ISA

mysubformatsemantics

myxorformatsemantics

ExecutableExecutable

Sourceprogram

25


Krishna V Palem

Georgia Tech

Record of execution

EPIC execution model

ILP

Functional units

26


Krishna V Palem

Georgia Tech

Reconfigure datapath

Adaptive EPIC execution model

Record of execution

ILP-2

ILP-1

ConfiguredFunctional

units

27


Krishna V Palem

Georgia Tech

Adaptive EPIC

Simple ASIC

Complex ASIC

RaPiD

FPGA

GARP

DPGA

SuperSpeculative

RAW

TRACE (Multiscalar)

SMT

VECTOR

MultiChipCVH

SuperScalar

Simple Pipelined/Embedded

EPIC/VLIW

0 4 16 32 64 128-512 1K-10K 100K-1M >1M

TTA

Early x86

Para

llel

ism

Approximate instruction packet size

Dataflow

What can be efficiently compiled for today?

28


Krishna V Palem

Georgia Tech

Adaptive EPIC

AEPIC = EPIC processing + Datapath reconfiguration

Reconfiguration through three basic operations:

– add a functional unit to the datapath

– remove a functional unit from the datapath

– execute an operation on resident functional unit

RLA

F2F1

29


Krishna V Palem

Georgia Tech

AEPIC Machine Organization

Multi-context Reconfigurable Logic

Array

Multi-context Reconfigurable Logic

Array

Array Register FileArray Register File

Configuration cacheConfiguration cache

Level-1 configuration

cache

Level-1 configuration

cache

L1L1

L2L2

CRFCRF

GPR/FPR/…GPR/FPR/…

I-CacheI-Cache

F/DF/D

30


Krishna V Palem

Georgia Tech

AEPIC Compilation

31


Krishna V Palem

Georgia Tech

Key Tasks For AEPIC Compiler

Partitioning

– Identifying code sections that convert to custom instructions

Operation synthesis

– Efficient CFU implementations for identified partitions

Instruction selection

– Multiple CFU choices for code partitions

Resource allocation

– Allocation of C-cache and MRLA resources for CFUs

Scheduling

– When and where to schedule adaptive component instructions

32


Krishna V Palem

Georgia Tech

Machine descriptionMachine

description

ConfigurationLibrary

ConfigurationLibrary

Performance statisticsPerformance statistics

Source programSource program

SimulationSimulation

Code generationCode generation

Post-pass schedulingPost-pass scheduling

Resource allocationResource allocation

Pre-pass schedulingPre-pass scheduling

High-level optimizations(EPIC core)

High-level optimizations(EPIC core)

Mapping(Adaptive)Mapping

(Adaptive)

PartitioningPartitioning

Front-end processingFront-end processing

Compiler Modules

33


Krishna V Palem

Georgia Tech

RLRL

Reconfigurable logic can accommodate

only two configurationssimultaneously

Processor

B G

R Overlappinglive-range

region

Record of execution

Resource allocation for configurations

34


Krishna V Palem

Georgia Tech

Managing Reconfigurable Resource (contd.)

Simpler version related to register allocation problem

New parameters

– No need to save to memory on spill

– configurations immutable

– Load costs different for different configurations

– C-cache, MRLA = multi-level register file

Adapted register allocation techniques to configuration

allocation

– Non-uniform sizes of configurations: graph multi-coloring

– Adapted Chaitin’s coloring based techniques

35


Krishna V Palem

Georgia Tech

The Problem With Long Reconfiguration Times

Configurationload time

Configurationexecution time

36


Krishna V Palem

Georgia Tech

Speculating Configuration Loads

• Mask reconfiguration times!• Need to know when and where to speculate

Configurationexecution time

Speculated configuration loadsf1 f2

• If f1>>f2 do not speculate to “red” empty load slot

37


Krishna V Palem

Georgia Tech

Sample Compiler Topics

Configuration cache management

Power/clock rate vs. Performance tradeoff

Bit-width optimizations

38


Krishna V Palem

Georgia Tech

More Generally Architecture Assembly

An ISA view

Synthesis and other hardware design off-line

Much closer to compiler optimizations implies faster compile time

Applications

Program

“Compiler” selects assembles

and optimizes program

Dynamically variable ISA

Architecture implementation

Prebuilt

Implementations

Build off-line (synthesis, place

and route)

Data path Storage Interconnect

Also applicable to yield fixed implementations in silicon

what is a compiler when the architecture is not hard(ware) ?

Documents