silicon architectures for wireless systems – part 2

66
Tutorial Tutorial HotChips HotChips 01 01 Jan M. Rabaey Jan M. Rabaey BWRC University of California @ Berkeley http://www.eecs.berkeley.edu/~jan Silicon Architectures for Silicon Architectures for Wireless Systems Wireless Systems Part 2 Part 2 Configurable Processors Configurable Processors With contributions from J. Wawrzynek and A. Dehon

Upload: others

Post on 01-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Silicon Architectures for Wireless Systems – Part 2

Tutorial Tutorial HotChipsHotChips 01 01

Jan M. RabaeyJan M. Rabaey

BWRC

University of California @ Berkeley

http://www.eecs.berkeley.edu/~jan

Silicon Architectures forSilicon Architectures for

Wireless Systems Wireless Systems –– Part 2 Part 2

Configurable ProcessorsConfigurable Processors

With contributions from J. Wawrzynek and A. Dehon

Page 2: Silicon Architectures for Wireless Systems – Part 2

The Energy-Flexibility GapThe Energy-Flexibility Gap

Embedded ProcessorsSA110

0.4 MIPS/mW

ASIPs

DSPs 2 V DSP: 3 MOPS/mW

Dedicated

HW

Flexibility (Coverage)

En

erg

y E

ffic

ien

cy

MO

PS

/mW

(or

MIP

S/m

W)

0.1

1

10

100

1000

Reconfigurable

Processor/Logic

Pleiades

10-80 MOPS/mW

Page 3: Silicon Architectures for Wireless Systems – Part 2

Session: DAC 2001Session: DAC 2001

Page 4: Silicon Architectures for Wireless Systems – Part 2

The Growth of ReconfigurableThe Growth of Reconfigurable

Source: Schaumont et al., DAC 2001

Page 5: Silicon Architectures for Wireless Systems – Part 2

(Re)configurable Computing:(Re)configurable Computing:

Merging Efficiency and VersatilityMerging Efficiency and Versatility

“Hardware” customized to

specifics of problem.Direct map of problem

specific dataflow, control.

Circuits “adapted” as

problem requirements

change.

Spatially programmed connection of processing elements.Spatially programmed connection of processing elements.

Page 6: Silicon Architectures for Wireless Systems – Part 2

Spatial vs. Temporal ComputingSpatial vs. Temporal Computing

Spatial Temporal

Source: A. Dehon and J. Wawrzynek

Page 7: Silicon Architectures for Wireless Systems – Part 2

Benefits of ProgrammableBenefits of Programmable

� Non-permanent customization andapplication development after fabrication– “Late Binding”

� economies of scale (amortize large, fixeddesign costs)

� time-to-market (evolving requirements andstandards, new ideas)

Disadvantages� Efficiency penalty (area, performance, power)

� Correctness Verification

Page 8: Silicon Architectures for Wireless Systems – Part 2

Spatial/Configurable BenefitsSpatial/Configurable Benefits

� 10x raw density advantage over processors(and increasing)

� Energy efficiency (potentially)

� Locality, regularity, and predictability

� Ultimate distributed architecture

� Scalable with technology– Relies mostly on increase in computational density

– Avoids most of the physics pitfalls threatening high-performance computing

Page 9: Silicon Architectures for Wireless Systems – Part 2

Spatial/Configurable Drawbacks

� Resource management

– Each compute/interconnect resource dedicated to singlefunction

– Must dedicate resources for every computational subtask

– Infrequently needed portions of a computation sit idle -->inefficient use of resources

– But … not a real issue when transistors are abundant

� Potential mismatch between operations andoperators

� Interconnect plays dominant role

Page 10: Silicon Architectures for Wireless Systems – Part 2

Density ComparisonDensity Comparison

Page 11: Silicon Architectures for Wireless Systems – Part 2

Processor vs. FPGA AreaProcessor vs. FPGA Area

Page 12: Silicon Architectures for Wireless Systems – Part 2

Processors and FPGAsProcessors and FPGAs

Page 13: Silicon Architectures for Wireless Systems – Part 2

Issues in Configurable DesignIssues in Configurable Design

� Choice and Granularity ofComputational Elements

� Choice and Granularity of InterconnectNetwork

� (Re)configuration Time and Rate– Fabrication time --> Fixed function devices

– Beginning of product use --> Actel/QuicklogicFPGAs

– Beginning of usage epoch -->(Re)configurable FPGAs

– Every cycle --> traditional Instruction SetProcessors

Page 14: Silicon Architectures for Wireless Systems – Part 2

Granularity of Computational ElementsGranularity of Computational Elements

In Out

00 0

01 1

10 1

11 0

2-LUT

Mem

In1 In2

Out

The FPGA Approach: The Logic Level

Page 15: Silicon Architectures for Wireless Systems – Part 2

Granularity of Computational ElementsGranularity of Computational Elements

ReconfigurableReconfigurable

LogicLogic

ReconfigurableReconfigurable

DatapathsDatapaths

adder

buffer

reg0

reg1

mux

CLB CLB

CLBCLB

DataMemory

InstructionDecoder

&Controller

DataMemory

ProgramMemory

Datapath

MAC

In

AddrGen

Memory

AddrGen

Memory

ReconfigurableReconfigurable

ArithmeticArithmetic

ReconfigurableReconfigurable

ControlControl

Bit-Level Operations

e.g. encoding

Dedicated data paths

e.g. Filters, AGU

Arithmetic kernels

e.g. Convolution

RTOS

Process management

Page 16: Silicon Architectures for Wireless Systems – Part 2

For Spatial ArchitecturesFor Spatial Architectures

� Interconnect dominant

– area

– power

– time

� …so need to understand in order to

optimize architectures

Page 17: Silicon Architectures for Wireless Systems – Part 2

Dominant in AreaDominant in Area

Page 18: Silicon Architectures for Wireless Systems – Part 2

Dominant in TimeDominant in Time

Page 19: Silicon Architectures for Wireless Systems – Part 2

Dominant in PowerDominant in Power

65%

21%

9%5%

Interconnect

Clock

IO

CLB

XC4003A data from Eric Kusse (UCB MS 1997)

Page 20: Silicon Architectures for Wireless Systems – Part 2

Interconnect Design IssuesInterconnect Design Issues

� Flexibility -- route “anything”

– (within reason?)

� Area -- wires, switches

� Delay -- switches in path, stubs, wire

length

� Power -- switch, wire capacitance

� Routability -- computational difficulty

finding routes

Page 21: Silicon Architectures for Wireless Systems – Part 2

A NaA Naïïve Approach: Crossbarve Approach: Crossbar

� Any operator may

consume output

from any other

operator

Page 22: Silicon Architectures for Wireless Systems – Part 2

Avoiding Crossbar CostsAvoiding Crossbar Costs

� Good architectural design

– Optimize for the common case

� Designs have spatial locality

� We have freedom in operator

placement

� Thus: Place connected components

“close” together

– don’t need full interconnect?

Page 23: Silicon Architectures for Wireless Systems – Part 2

Exploiting Locality Exploiting Locality –– The Mesh The MeshSwitch Box

Connect Box

Page 24: Silicon Architectures for Wireless Systems – Part 2

Meshes donMeshes don’’t scalet scale

Typical Extensions

� Local neighbor-to-neighbor Interconnections

� Segmented Interconnect

� Hierarchical Network (tree, mesh)

Page 25: Silicon Architectures for Wireless Systems – Part 2

Example 1:Example 1:

The Pleiades Reconfigurable ArchitectureThe Pleiades Reconfigurable Architecture

A Satellite ProcessorA Satellite Processor

Configuration

Dedicated

Arithmetic

Configuration Bus

Reconfigurable Interconnect Network

Embedded

Processor

FPGA MemoryAddress

Generator

Arithmetic

Processor

Arithmetic

Processor

..

..

Network Interface

� Computational kernels are “spawned” to satellite processors

� Control processor supports RTOS and reconfiguration

� Order(s) of magnitude energy-reduction over traditional

programmable architectures

Page 26: Silicon Architectures for Wireless Systems – Part 2

Matching Computation and ArchitectureMatching Computation and Architecture

AddressGen AddressGen

Memory Memory

MAC MAC

Control

Processor

L CG

Convolution

Two models of computation:communicating processes + data-flow

Two architectural models:sequential control+ data-driven

Page 27: Silicon Architectures for Wireless Systems – Part 2

Execution Model of a Data-FlowExecution Model of a Data-Flow

KernelKernel

for(i=1;i<=L;i++)

for(k=i;k<=L;k++)

phi[i][k]= phi[i-1][k-1]

+in[NP-i]*in[NP-k]

-in[NA-1-i]*in[NA-1-k];

endstart

Embedded processor

AddrGen

MEM: in

ALU

ALU

AddrGen

MEM: phi

MPY MPY

• Distributed control and memory

Code seg

Code seg

Page 28: Silicon Architectures for Wireless Systems – Part 2

Reconfigurable Kernels for W-Reconfigurable Kernels for W-

CDMACDMA

� Dominant kernel M(MTX)

requires array of MACs and

segmented memories

� Additional operations such as

sqrt(x), 1/x, and Trellis decoding

may be implemented using

FPGA or cordic satellite

Page 29: Silicon Architectures for Wireless Systems – Part 2

Impact of Architectural ChoiceImpact of Architectural Choice

1870

StrongARM

131

Normalized Energy / stage [nJ]

TMS320C2xx

Energy/stage

49

TMS320LC54x

1000

100

10000

10

21u

StrongARM

10u

Normalized Delay/stage [s]

TMS320C2xx

Delay/stage

3.8u

TMS320LC54x

10u

1u

100u

100n

18.5

TMS320LC54x

Normalized Energy*Delay / stage [Js*e-14]

10

1

100

1000 Energy*Delay/stage

137

TMS320C2xx

0.1

3970

StrongARM

10000

Example: 16 point Complex

Radix-2 FFT (Final Stage)

13

570n 0.75

Ple

iad

es

Ple

iad

es

Ple

iad

es

Page 30: Silicon Architectures for Wireless Systems – Part 2

Architecture ComparisonArchitecture Comparison

LMS LMS Correlator Correlator at 1.67 at 1.67 MSymbolsMSymbols Data Rate Data Rate

Complexity: 300 Complexity: 300 MmultMmult/sec and 357 /sec and 357 MaccMacc/sec/sec

Note: TMS implementation requires 36 parallel processors to meet data rate -

validity questionable

16 Mmacs/mW!

Page 31: Silicon Architectures for Wireless Systems – Part 2

Inter-Satellite CommunicationInter-Satellite Communication� Data-driven execution

– A satellite processor is enabled only when input data is ready

� Data sources generate data of different types: scalars,vectors, matrices

� Data computing processors handle data inputs of differenttypes end-of-vector token

MPY

MPY1

nn

MACn

n1

AddrGen Memory

Embedded

processor

1

11

Data sources Data computing processors

Page 32: Silicon Architectures for Wireless Systems – Part 2

AGP SatelliteAGP Satellite� Address generator for the SRAM satellite

� Generates data streams of different types; all other satellites

process data streams

� Uses loop counters and stride counters to support 2 levels of

nesting

� Control information sent in parallel with the data using 2

additional control bits

for n=1 to 3 {

for k=1 to 3 {

addr_read(n,k);

}

} End of vector

End of matrix

Page 33: Silicon Architectures for Wireless Systems – Part 2

Satellite Processors: FPGASatellite Processors: FPGA

� Reconfigurable for both logic

function and interface control

� 4 x 9 CLB array in total

� 5-input 3-output CLBs

� 3 levels of interconnect hierarchy

� Mapped to various arithmetic

functions and control

� Programmable clock generator

4 x 8

FPGA Array

4 x 1

Interface Control

IN1

IN2

OUT

REQ

ACK

REQ

ACK

Programmable delay

Page 34: Silicon Architectures for Wireless Systems – Part 2

Low-Energy Embedded FPGALow-Energy Embedded FPGA

� Test chip

– 8x8 CLB array

– 5 in - 3 out CLB

– 3-level interconnect hierarchy

– 4 mm2 in 0.25 µm ST CMOS

– 0.8 and 1.5 V supply

� Simulation Results

– 125 MHz Toggle Frequency

– 50 MHz 8-bit adder

– energy 70 times lower than

comparable Xilinx

� Parameterized module

generator available

Page 35: Silicon Architectures for Wireless Systems – Part 2

Reconfigurable Interconnect NetworkReconfigurable Interconnect Network

Universal Switchbox

Cluster

Cluster

Level-1 Mesh Level-2 Mesh

Irregular mesh for Heterogeneous blocksIrregular mesh for Heterogeneous blocks

� A channel along every side of each block

� A switch box at every cross-point

Hierarchical Switchbox

Building hierarchy by clusteringBuilding hierarchy by clustering

� Intra-cluster: mesh structure

� Inter-cluster: larger-granularity mesh

Saves energy by a factor of 7 comparedSaves energy by a factor of 7 compared

to straightforward crossbar network!to straightforward crossbar network!

Page 36: Silicon Architectures for Wireless Systems – Part 2

Fast Design Space ExplorationFast Design Space Exploration

Interconnect ModelsInterconnect Models

N Inputs

B Buses

M Outputs

Multi-Bus

cluster

cluster

cluster

Hierarchical MeshMesh

Module

Model:Model:

�� Interconnect energy and delay model Interconnect energy and delay model

�� Algorithm mapping Algorithm mapping

�� Graph-based place and route Graph-based place and route

Page 37: Silicon Architectures for Wireless Systems – Part 2

Reconfiguration ModelReconfiguration Model

� Configuration codes are created statically at compile time

� Every configuration memory is reset and rewritten with new

configuration code before each kernel

� The core processor uses memory read/write instructions to

perform the reconfiguration

mem

Hardware modules

Reconfiguration

Interface Unit

Configuration

codes

Distributed

configuration memories

mem mem

mem

Core processor

Page 38: Silicon Architectures for Wireless Systems – Part 2

Kernel Execution & ConfigurationKernel Execution & Configuration

ARM8

SRAM

Interface

AGP1 MEM1

AGP2 MEM2

MAC1

OPORT1

IPORT1Interconnect

Page 39: Silicon Architectures for Wireless Systems – Part 2

MaiaMaia: Reconfigurable : Reconfigurable BasebandBaseband

Processor for WirelessProcessor for Wireless

� 0.25um tech: 4.5mm x 6mm

� 1.2 Million transistors

� 40 MHz at 1V

� 1 mW VCELP voice coder

� Hardware

� 1 ARM-8

� 8 SRAMs & 8 AGPs

� 2 MACs

� 2 ALUs

� 2 In-Ports and 2 Out-Ports

� 14x8 FPGA

Page 40: Silicon Architectures for Wireless Systems – Part 2

Results of VCELP Voice CoderResults of VCELP Voice Coder

79.7% of VCELP Code maps

onto Reconfigurable Datapath

VCELP code breakdown VCELP Energy breakdown

Compared to state-of-art 17mW DSP

Functionality Energy (mJ) for 1 sec

of VCELP speech

processing

Dot product 0.738

FIR filter 0.131

IIIR filter 0.021

Vector sum with

scalar multiply

0.042

Compute code 0.011

Kernels

Covariance matrix

compute

0.006

Program control 0.838

Total 1.787

Page 41: Silicon Architectures for Wireless Systems – Part 2

Design Methodology and FlowDesign Methodology and Flow

� Requires architecture exploration overheterogeneous implementation fabrics

� Should support refinement and co-designof hardware and software, as well asbehavior and architecture

� Should consider all important metrics, andpresent PDA (Power-Delay-Area)perspective

Page 42: Silicon Architectures for Wireless Systems – Part 2

Software Methodology FlowSoftware Methodology Flow

Algorithms

Kernel Detection

Estimation/Exploration

Partitioning

Software CompilationReconfig. Hardware Mapping

Interface Code Generation

Power & Timing Estimation

of Various Kernel Implementations

PDA Models

PremappedKernels

Acceleratorµproc &

Behavioraln

C++ Module Libraries

C++

SUIF+ C-IF

Page 43: Silicon Architectures for Wireless Systems – Part 2

Hardware-Software ExplorationHardware-Software Exploration

Macromodel call

Page 44: Silicon Architectures for Wireless Systems – Part 2

Industrial Example 1:Industrial Example 1:

XtensaXtensa Configurable Processor Configurable Processor

Source: Tensilica, Inc

Combines spatial and

temporal processing

Small core: 0.7 mm2 in 0.18 mm

~ 3 MIPS/mW

Core processor with extendible instruction set

Page 45: Silicon Architectures for Wireless Systems – Part 2

Design Methodology Design Methodology –– a Crucial Component a Crucial Component

Source: Tensilica, Inc

Page 46: Silicon Architectures for Wireless Systems – Part 2

Example: A DES Encryption ExtensionExample: A DES Encryption Extension

4 extra instructions

1700 additional gates

No cycle time impact

Code size reduction

Source: Tensilica, Inc

Page 47: Silicon Architectures for Wireless Systems – Part 2

Improvement over GP 32-bit processorImprovement over GP 32-bit processor

Source: Tensilica, Inc

Page 48: Silicon Architectures for Wireless Systems – Part 2

Industrial Example 2: Chameleon RCPIndustrial Example 2: Chameleon RCP(Reconfigurable Communications Processor)(Reconfigurable Communications Processor)

24 multipliers

128 DPUsSource: Chameleon, Inc

Page 49: Silicon Architectures for Wireless Systems – Part 2

Reconfigurable Processing FabricReconfigurable Processing Fabric

Source: Chameleon, Inc

Page 50: Silicon Architectures for Wireless Systems – Part 2

CS2000 Performance NumbersCS2000 Performance Numbers

Page 51: Silicon Architectures for Wireless Systems – Part 2

CS2000 PerformanceCS2000 Performance

Source: Chameleon, Inc

Power efficiency?

Page 52: Silicon Architectures for Wireless Systems – Part 2

Design MethodologyDesign Methodology

Source: Chameleon, Inc

Page 53: Silicon Architectures for Wireless Systems – Part 2

Industrial Example 3:Industrial Example 3:The The MorphICs MorphICs Dynamically Reconfigurable Architecture (DRA)Dynamically Reconfigurable Architecture (DRA)

DSP

Core

Memory

MCU

Core

WCDMA

CDMA

IS-136

GSM

Fixed logic…

DRA ProcessorDRA Processor

Software programmable

Hardware reconfigurable

Software

Download

WCDMA (mode, param)

CDMA (mode, param)

WTDMA (mode, param)

TDMA (mode, param)

� SIM CardSIM Card

�� Handset Memory Handset Memory

�� POS Programming POS Programming

�� Network Download Network Download

�� OTA Download OTA Download

Realizes Realizes cost, size and power targets similar to traditional core+hardwiredcost, size and power targets similar to traditional core+hardwired

Source: Morphics Technology

Page 54: Silicon Architectures for Wireless Systems – Part 2

Basestation Basestation of the Next Generation Wirelessof the Next Generation Wireless

Wideband

RF

Data

Networks

10/

100

or

Gbit

ATM

800 MHz

A B A B

RF/IF

Tuner

Block-Spectrum

A/D

RF/IF

Tuner

Block-Spectrum

A/D

RF/IF

Tuner

Block-Spectrum

A/D

Antenna

System

multiple sectors

multi-band

1900 MHz

A D B CFE

RF/IF

Tuner

Block-Spectrum

A/D

RF/IF

Tuner

Block-Spectrum

A/D

RF/IF

Tuner

Block-Spectrum

A/D

Antenna

System

multiple sectors

multi-band

Page 55: Silicon Architectures for Wireless Systems – Part 2

The common approach to hardware designinvolves:

multiple ASIC’s to support each standard.

� Hardwired implementation is not scalable or upgradeable to new standards.

� This approach costs time in a time-to-market dominated world.

� Creating new chipsets for every technology combination critically challenges

available design resources!

HW HW MultistandardMultistandard Solutions Solutions

Digital

Hardwired

ASICIF RF

Digital

Hardwired

ASICIF RF

Digital

Hardwired

ASICIF RF

AnalogProgrammable Unique

Combinations

DSP

Control Processor

Page 56: Silicon Architectures for Wireless Systems – Part 2

SW SW MultistandardMultistandard Solution SolutionApplying instruction-set processor architectures to

all baseband processing would be desireable...

…but is simply not an good implementation for base stations:-Unacceptably high cost per channel

-Unacceptably large power per channel

This is definitely not a viable implementation for terminals

AnalogProgrammable

IF RF

IF RF

IF RF

DSP

Control Processor

Page 57: Silicon Architectures for Wireless Systems – Part 2

FPGA the Solution?FPGA the Solution?

Cellular Handset Using Current FPGACellular Handset Using Current FPGA

Source: Morphics Technology

Page 58: Silicon Architectures for Wireless Systems – Part 2

Define optimal architecture for efficient

implementation

Successfully Using ReconfigurabilitySuccessfully Using Reconfigurability

Application-Specific Leverage

Focus on first on applications and constituent algorithms, not the

silicon architecture !

Wireless Communications Transceiver Signal Processing

Minimize the hardware reconfigurability to constrained set

Maximize the software parameterizability and ease of use of the

programmer’s model for flexibility

Source: Morphics Technology

Page 59: Silicon Architectures for Wireless Systems – Part 2

Application-Specific MOPS in DigitalApplication-Specific MOPS in Digital

CommunicationsCommunications

RF/IF

TDMA

Wideband Signal

Processing Engine

CDMA

Wideband Signal

Processing Engine

Wideband Channel

Decoder Engine

Programmable

DSP

Microprocessor

Digital

Downconversion

and

Channelization

Source: Morphics Technology

Page 60: Silicon Architectures for Wireless Systems – Part 2

MorphicsMorphics’’ DRL Architecture DRL ArchitectureHeterogeneous Multiprocessing Engine Using Application-

Specific Reconfigurable Logic

m

R

m

O

Clk

Enableoutput

input

input

m

R

m

O

Clk

Enableoutput

input

input

m

R

m

O

Clk

Enable

output

input

input

Small Granularity Kernel

Large Granularity Kernel

DA

TA

FL

OW

Source: Morphics Technology

Page 61: Silicon Architectures for Wireless Systems – Part 2

DRL KernelsDRL Kernels

DATA SEQUENCER

DATA MEMORY

PARAMETERIZABLE

CONFIGURABLE

ALU

Source: Morphics Technology

Page 62: Silicon Architectures for Wireless Systems – Part 2

Key Pieces of Design MethodologyKey Pieces of Design MethodologySystem-level Profiling� Analyze sequences of operations (arithmetic, memory access, etc)

� Analyze communication bottlenecks

� Key flexible parameters (algorithm v architecture parameters)

Architecture-level Profiling� ALU/kernel definition (sequences of operators)

� Memory profile

� Type of configurability required for flexibility

� Macro-sequencer development

Implementation� SW- programmer’s model developed at architecture specification stage

� SW- API proven out via behavioral models & demonstrator hardware

� VLSI-focus on regular predictable timing and routability

� VLSI- embedded reconfigurability in an ASIC flow

Source: Morphics Technology

Page 63: Silicon Architectures for Wireless Systems – Part 2

Architectural ImplementationArchitectural Implementation

Comparison: Reconfigurable FFTComparison: Reconfigurable FFT

Energy per Transform

vs. FFT size

Transforms per Second per mm2

vs. FFT size

* All results are scaled to 0.18µm

101

102

103

103

104

105

106

107

108

Function-specific reconfigurable hardware

Data-path reconfigurable processor

FPGA

Low-power DSP

High-performance DSP

101

102

103

10-10

10-9

10-8

10-7

10-6

10-5

10-4

10-3

Lower limit

Function-specific reconfigurable hardware

Data-path reconfigurable processor

FPGA

Low-power DSP

High-performance DSP

Page 64: Silicon Architectures for Wireless Systems – Part 2

Architectural Implementation Comparison:Architectural Implementation Comparison:

ReconfigurableReconfigurable Viterbi Viterbi Decoder Decoder

Energy per Decoded Bit

vs. Number of States

Decoding Rate per mm2

vs. Number of States

* All results are scaled to 0.18µm

101

102

10-11

10-10

10-9

10-8

10-7

10-6

10-5

10-4

Lower limit

Function-specific reconfigurable hardware

Data-path reconfigurable processor

FPGA

Low-power DSP

High-performance DSP

101

102

104

105

106

107

108

109

1010

Function-specific reconfigurable hardware

Data-path reconfigurable processor

FPGA

Low-power DSP

High-performance DSP

Page 65: Silicon Architectures for Wireless Systems – Part 2

Source: Source: SchaumontSchaumont et al, DAC 2001 et al, DAC 2001

Page 66: Silicon Architectures for Wireless Systems – Part 2

SummarySummary

� Configurable computing is finding its way into the

embedded processor space

� Best suited (so far) for

– Flexible I/O and Interface functionality

– providing task-level acceleration of “parametizable” functions

� Software flow still subject to improvement

DO NOT FORGET CONFIGURATION OVERHEAD