transparent heterogeneous cloud acceleration

Jessica Vandebon, José G. F. Coutinho, Wayne Luk, Thomas Chau

Transparent Heterogeneous Cloud Acceleration

➔ Goal: make heterogeneous compute resources

accessible to resource-oblivious cloud applications

➔

➔ ORIAN extends homogeneous PaaS execution model: ◆

1. transparent acceleration: automatic runtime

mapping of jobs to the best worker

2. heterogeneous elasticity: automatic vertical (type)

and horizontal (quantity) scaling to support QoS

while minimising cost

CRbS: A Code Reordering Based Speeding-up Method of Irregular Loops On CMPAuthors�Li Yuancheng , Shi Jiaqi

×((

RAWCQIP

( )

l CMP(chip multiprocessor) used to improve throughput and speed up multithreaded applications is becomingmore and more commonplace.

l It is difficult for compilers to automatically parallelize irregular single-threaded applications which havecomplex data dependence caused by non-linear subscripts, pointers, or function calls within code sections.

The Fig.1 shows the execution model. From the Fig.1(d) ,we can see that RAW (Read-After-Write, can be detected byhardware mechanisms) occurs, the speculative thread willbe restarted. So we adopt the code reordering method toreduce the RAW.

By means of code reordering , we can cut down the data-dependence by reordering the key codes that may causedata-dependence. And Fig.2 shows the experimental results .The experimental results show that the proposed method iseffective.

Fig.1 Sequential VS SpMT parallel execution

Fig.2 The Speedup based the code reordering method

Impact of Structural Faults on Neural Network PerformanceKrishna Teja Chitty-Venkata and Arun Somani

Iowa State University, Ames, IA, USA

Dependable Computing and Networking Laboratory

Row Faults

Column Faults

Systolic ArrayWeight Matrix

Row Faults: DNNs are resistant to row faults till an extentColumn Faults: They have have very high impact onnetwork accuaracy (if the column is used)

Mitigation Stratgies: Matrix Transpose, Array Reduction

Impact of Faults

http://dcnl.ece.iastate.edu/dcnl/

Energy-Efficient Near Sensor Convolution using Pulsed Unary ProcessingM. Hassan Najafi, S. Rasoul Faraji, Kia Bazargan, and David Lilja

• Near-sensor convolution has many applications in IoT• Conventional fixed-point designs: Fast and accurate, but complex and costly• SC-based designs: Low-cost, but low accuracy, long latency, high energy consumption• Analog-to-digital (ADC) and analog-to-stochastic (ASC) converters are also costly

v Traditional designs inefficient for near-sensor processing

• Pulsed Unary Processing• Introduced recently for high-performance processing using stochastic logic• Inputs represented using PWM signals, duty cycle determines the value

• Proposed Design • Input data are converted to inharmonic PWM signals • AND gates are used to do the multiplications• Outputs of AND gates are accumulated using an active integrator

• Evaluation• Significant improvement in the hardware area cost, power, and energy consumption• Avoiding costly ADCs

ASAP 2019

Example of multiplication using PWM signals

An Efficient Application Specific Instruction Set Processor (ASIP) for Tensor ComputationHuang Weipei, CHEUNG Chak-Chung Ray, YAN Hong

• Definition - a processor which have their instruction sets augmented and tailored towards a specific application.

• Advantage – programmable, more flexible, lower cost and time for implementation shortened; higher power efficiency

• Current application-specific processors – TPUinvented by Google; China Cambricon released the Cambrian processor for computer vision and AI

Non-critical computations

Users

Tensor-Matrix, Tensor-Vector

operations

Computation partition

Program design

Software HardwareOrders

Data & instructions

Data

Algorithm flow

Users

Fig. Typical design flow on ASIP system

Instruction Fetch

Instruction buffer

Decoder

Loading data

Loading addr

CPU

Inte

rface

Processing core

Inputinstructions

Input size info.

startfinish

Size buffer

Control Interface

Controlunit

Fetch Decoder

Mem

ory

Inte

rface

Adress generator

Size reg

Processing modules

Current instruction

Control signal

Memony addressData signal

address bus

data bus

Fig. ASIP architecture for tensor processing

Running at 111MhZPower consumption: 1.87W~14ms for CP decomposition of 40*40*1200 tensor

MITRACA | 1/2

MITRACA: Manycore Interlinked TorusReconfigurable Accelerator Architecture

The 30th IEEE International Conference on

Application-specific Systems, Architectures and Processors

Riadh Ben Abdelhamid, Yoshiki Yamaguchi and Taisuke Boku

University of Tsukuba, Japan

Graduate School of Systems and Information Engineering

Department of Computer Science

July 15, 2019

MITRACA | 2/2

Overview of the proposed overlay architecture

DeltaNet: Differential Binary Neural NetworkYuka Oba*, Kota Ando*, Tetsuya Asai*, Masato Motomura†, and Shinya Takamaeda-Yamazaki*‡

(*Hokkaido University, †Tokyo Institute of Technology, ‡JST PRESTO)

#103

Information lossBNN’s Disadvantage

Idea: Binarization of difference

Effective use of information before binarization.

MOTTAINAI!!

Proposed method

Using Residue Number Systems to Accelerate Deterministic Bit-stream MultiplicationKamyar Givaki, Reza Hojabr, M. Hassan Najafi, Ahmad Khonsari, M.H. Gholamrezayi, Saeid Gorgin, Dara Rahmati

• Deterministic methods of SC• Conventional SC: low accuracy, long latency, and high energy consumption• Recently proposed Deterministic methods of SC: Fully accurate but still long latency and high energy consumption• Guarantee that each bit of the first bit-stream sees each bit of the second bit-stream exactly once

v Conventional SC designs are not suitable for applications that require accurate computation

• Three Deterministic methods of SC• Rotation of bit-streams• Clock dividing bit-streams • Using bit-streams with relatively prime lengths

• Proposed Design • Combines the idea of deterministic SC with Residue Number System (RNS)

• Residues have lower bit-widths -> exponentially shorter bit-streams• The final results are in the RNS format -> need to be converted back to binary• Number generators can be shared between all of the computation lanes.

• Evaluation• Smaller bit-stream generators• Significant improvement in terms of computation time and energy consumption compared to previously

proposed deterministic methods of SCASAP 2019

www.kalrayinc.com

Nicolas Brunie, Christoph Lauter, Guillaume Revy

Precision Adaptation for Fast and Accurate Polynomial Evaluation Generation

ASAP 2019, Cornell University, July 15th 2019

¤2019 Kalray SA All rights reserved. Confidential information Page 2

Application of “computing just right” to polynomial evaluation

Summary: • Composing various tools

• CGPE • Metalibm-Lutetia • Metalibm-Lugdunum

• Exploring imp. space • Generating implementation • Generating error certificate

Target sine zeros exp sinh

10 0.70 0.73 1.07 1.08

55 1.10 - 1.42 0.98

85 - - 2.73 3.48

transparent heterogeneous cloud acceleration

Documents