what is a compiler when the architecture is not hard(ware) ?
DESCRIPTION
What Is a Compiler When The Architecture Is Not Hard(ware) ?. Krishna V Palem. This work was supported in part by awards from Hewlett-Packard Corporation, IBM, Panasonic AVC, and by DARPA under Contract No. DABT63-96-C-0049 and Grant No. 25-74100-F0944. - PowerPoint PPT PresentationTRANSCRIPT
UCB November 8, 2001
Krishna V Palem
Georgia Tech
What Is a Compiler When The What Is a Compiler When The Architecture Is Not Hard(ware) ?Architecture Is Not Hard(ware) ?
Krishna V Palem
This work was supported in part by awards from Hewlett-Packard Corporation, IBM, Panasonic AVC, and by DARPA under Contract No. DABT63-96-C-0049 and Grant No. 25-74100-F0944.
Portions of this presentation were given by the speaker as a keynote at the ACM LCTES2001 and as an invited speaker at EMSOFT01
2
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Embedded Computing
Why ?
What ?
How ?
3
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
The Nature of Embedded Systems
Visible Computing View is of end application
(Hidden computing element)
4
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Supported by Moore’s (second) law– Computing power doubles every eighteen monthsCorollary: cost per unit of computing halves every eighteen months
From hundreds of millions to billions of units
Projected by market research firms (VDC) to be a 50 billion+ space over the next five years
High volume, relatively low per unit $ margin
Favorable Trends
5
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Embedded Systems Desiderata
Low Power
- High battery life
Small size or footprint
Real-time constraints
Performance comparable to or
surpassing leading edge COTS technology
Rapid time-to-market
6
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Timing Example
Predictable Timing BehaviorUnpredictable Timing
Behavior
Video-On-Demand
7
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Current Art
Meet desiderata while overcoming NRE cost hurdles through volume
High migration inertia across applications Long time to market
Vertical application domains
Industrial Automation
Telecom
Select computational kernels To ASIC
8
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Subtle but Sure Hurdles
For Moore’s corollary to be true– Non-recurring engineering (NRE) cost must be amortized
over high-volume– Else prohibitively high per unit costs
Implies “uniform designs” over large workload classes– (Eg). Numerical, integer, signal processing
Demands of embedded systems– “Non uniform” or application specific designs– Per application volume might not be high– High NRE costs infeasible cost/unit– Time to market pressure
9
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
The Embedded Systems Challenge
Sustain Moore’s corollary– Keep NRE costs down
Multiple application domains
3D Graphics
Industrial Automation
Medtronics E-Textiles Telecom
Rapidly changing application
kernels in moderate volume
Custom computing solution
meeting constraints
10
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Responding Via Automation
Multiple application domains
3D Graphics
Industrial Automation
Medtronics E-Textiles Telecom
Rapidly changing application
kernels in low volume
Automatic
Tools
Power Timing
Size
Application specific design
11
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Three Active Approaches
Custom microprocessors
Architecture exploration and synthesis
Architecture assembly for reconfigurable computing
12
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Custom Processor Implementation
High performance implementation Customized in silicon for particular application domain O(months) of design time Once designed, programmable like standard processors
Proprietary
Tools
Proprietary ISA,
Architecture
Specification
Application
analysis
Fabricate
Processor
Custom
Processor
implementation
Application
Language
with custom
extensions
Compiler BinaryProprietary
ISA
Time Intensive
Tensilica, HP-ST Microelectronics approach
13
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Architecture Exploration and Synthesis
The PICO VisionProgram In
Automatic synthesis of
application specific
parallel / VLIW
ULSI microprocessors
And their compilers
for embedded computing
Chip Out“Computer design for the masses”
“A custom system architecture in 1 week tape-out in 4 weeks”B Ramakrishna Rau “The Era of Embedded Computing”, Invited talk, CASES 2000.
14
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Custom Microprocessors
Application(s)
define workload
Optimizing
Compiler
Analyze
Define ISA extension
(eg) IA 64+
Define Compiler
Optimizations
Design
Implementation
Microprocessor
(eg) Itanium +
15
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Application Specific Design
Single
Application
Program Analysis
Analyze
Library of possible
implementations
(Bypass ISA)
Explore and
Synthesize
implementations
VLIW Core + Non programmable extension
Applications
Application specific processor runs single application
Extended EPIC Compiler
Technology
16
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
The Compiler Optimization Trajectory
Frontend and Optimizer
Determine Dependences
Determine Independences
Bind Operations to Function Units
Bind Transports to Busses
Determine Dependences
Bind Transports to Busses
Execute
Superscalar
Dataflow
Indep. Arch.
VLIW
TTA
Compiler Hardware
Determine Independences
Bind Operations to Function Units
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.
17
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
What Is the Compiler’s Target ``ISA’’?
Target is a range of architectures and their building blocks
Compiler reaches into a constrained space of silicon
Explores architectural implementations
O(days – weeks) of design time
Exploration sensitive to application specific hardware modules
Fixed function silicon is the result
Verification NRE costs still there
One approach to overcoming time to market
Frontend and Optimizer
Determine Dependences
Determine Independences
Bind Operations to Function Units
Bind Transports to Busses
Determine Dependences
Bind Transports to Busses
Execute
Superscalar
Dataflow
Indep. Arch.
VLIW
TTA
Compiler Hardware
Determine Independences
Bind Operations to Function Units
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.
18
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Choices of Silicon
High level
design/synthesis
(EDIF) Netlist
Fixed silicon implementation
Standard cell design etc.
Emulated i.e.
Reconfigurable target
19
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Reconfigurable Computing
20
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
FPGAs As an Alternative Choice for Customization
Frequent (re)configuration and hence frequent recustomization
Fabrication process is steadily improving
Gate densities are going up
Performance levels are acceptable
Amortize large NRE investments by using COTS platform
21
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Major Motivations
Poor compilation times– Lack of correspondence between standard IR and final
configurations
– Place and route inherently complex
Would like to have compile times of the order of seconds to minutes
Provide customization of data-path by automatic analysis and optimization in software
22
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Adaptive EPIC
Adaptive Explicitly Parallel Instruction ComputingSurendranath TallaDepartment of Computer Science, New York UniversityPhD Thesis, May 2001 Janet Fabri award for outstanding dissertation.
Adaptive Explicitly Parallel Instruction ComputingKrishna V. Palem and Surendranath Talla, Courant Institute of Mathematical Sciences; Patrick W. Devaney, Panasonic AVC American Laboratories Inc.Proceedings of the 4th Australasian Computer Architecture Conference, Auckland, NZ. January 1999
23
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
CompilerCompiler
ISA
ADD
format
semantics
LDformat
semantics
Compiler-Processor Interface
Sourceprogram
Executable
RegistersExceptions..
24
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
CompilerCompiler
Redefining Processor-Compiler Interface
Let compiler determine the instruction sets (and their realization on chip)
Let compiler determine the instruction sets (and their realization on chip)
ISA
mysubformatsemantics
myxorformatsemantics
ExecutableExecutable
Sourceprogram
25
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Record of execution
EPIC execution model
ILP
Functional units
26
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Reconfigure datapath
Adaptive EPIC execution model
Record of execution
ILP-2
ILP-1
ConfiguredFunctional
units
27
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Adaptive EPIC
Simple ASIC
Complex ASIC
RaPiD
FPGA
GARP
DPGA
SuperSpeculative
RAW
TRACE (Multiscalar)
SMT
VECTOR
MultiChipCVH
SuperScalar
Simple Pipelined/Embedded
EPIC/VLIW
0 4 16 32 64 128-512 1K-10K 100K-1M >1M
TTA
Early x86
Para
llel
ism
Approximate instruction packet size
Dataflow
What can be efficiently compiled for today?
28
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Adaptive EPIC
AEPIC = EPIC processing + Datapath reconfiguration
Reconfiguration through three basic operations:
– add a functional unit to the datapath
– remove a functional unit from the datapath
– execute an operation on resident functional unit
RLA
F2F1
29
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
AEPIC Machine Organization
Multi-context Reconfigurable Logic
Array
Multi-context Reconfigurable Logic
Array
Array Register FileArray Register File
Configuration cacheConfiguration cache
Level-1 configuration
cache
Level-1 configuration
cache
L1L1
L2L2
CRFCRF
GPR/FPR/…GPR/FPR/…
I-CacheI-Cache
F/DF/D
30
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
AEPIC Compilation
31
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Key Tasks For AEPIC Compiler
Partitioning
– Identifying code sections that convert to custom instructions
Operation synthesis
– Efficient CFU implementations for identified partitions
Instruction selection
– Multiple CFU choices for code partitions
Resource allocation
– Allocation of C-cache and MRLA resources for CFUs
Scheduling
– When and where to schedule adaptive component instructions
32
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Machine descriptionMachine
description
ConfigurationLibrary
ConfigurationLibrary
Performance statisticsPerformance statistics
Source programSource program
SimulationSimulation
Code generationCode generation
Post-pass schedulingPost-pass scheduling
Resource allocationResource allocation
Pre-pass schedulingPre-pass scheduling
High-level optimizations(EPIC core)
High-level optimizations(EPIC core)
Mapping(Adaptive)Mapping
(Adaptive)
PartitioningPartitioning
Front-end processingFront-end processing
Compiler Modules
33
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
RLRL
Reconfigurable logic can accommodate
only two configurationssimultaneously
Processor
B G
R Overlappinglive-range
region
Record of execution
Resource allocation for configurations
34
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Managing Reconfigurable Resource (contd.)
Simpler version related to register allocation problem
New parameters
– No need to save to memory on spill
– configurations immutable
– Load costs different for different configurations
– C-cache, MRLA = multi-level register file
Adapted register allocation techniques to configuration
allocation
– Non-uniform sizes of configurations: graph multi-coloring
– Adapted Chaitin’s coloring based techniques
35
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
The Problem With Long Reconfiguration Times
Configurationload time
Configurationexecution time
36
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Speculating Configuration Loads
• Mask reconfiguration times!• Need to know when and where to speculate
Configurationexecution time
Speculated configuration loadsf1 f2
• If f1>>f2 do not speculate to “red” empty load slot
37
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
Sample Compiler Topics
Configuration cache management
Power/clock rate vs. Performance tradeoff
Bit-width optimizations
38
UCB – November 8, 2001
Krishna V Palem
Georgia Tech
More Generally Architecture Assembly
An ISA view
Synthesis and other hardware design off-line
Much closer to compiler optimizations implies faster compile time
Applications
Program
“Compiler” selects assembles
and optimizes program
Dynamically variable ISA
Architecture implementation
Prebuilt
Implementations
Build off-line (synthesis, place
and route)
Data path Storage Interconnect
Also applicable to yield fixed implementations in silicon