transparent heterogeneous cloud acceleration
TRANSCRIPT
Jessica Vandebon, José G. F. Coutinho, Wayne Luk, Thomas Chau
Transparent Heterogeneous Cloud Acceleration
➔ Goal: make heterogeneous compute resources
accessible to resource-oblivious cloud applications
➔
➔ ORIAN extends homogeneous PaaS execution model: ◆
1. transparent acceleration: automatic runtime
mapping of jobs to the best worker
2. heterogeneous elasticity: automatic vertical (type)
and horizontal (quantity) scaling to support QoS
while minimising cost
CRbS: A Code Reordering Based Speeding-up Method of Irregular Loops On CMPAuthors�Li Yuancheng , Shi Jiaqi
×((
RAWCQIP
( )
l CMP(chip multiprocessor) used to improve throughput and speed up multithreaded applications is becomingmore and more commonplace.
l It is difficult for compilers to automatically parallelize irregular single-threaded applications which havecomplex data dependence caused by non-linear subscripts, pointers, or function calls within code sections.
The Fig.1 shows the execution model. From the Fig.1(d) ,we can see that RAW (Read-After-Write, can be detected byhardware mechanisms) occurs, the speculative thread willbe restarted. So we adopt the code reordering method toreduce the RAW.
By means of code reordering , we can cut down the data-dependence by reordering the key codes that may causedata-dependence. And Fig.2 shows the experimental results .The experimental results show that the proposed method iseffective.
Fig.1 Sequential VS SpMT parallel execution
Fig.2 The Speedup based the code reordering method
Impact of Structural Faults on Neural Network PerformanceKrishna Teja Chitty-Venkata and Arun Somani
Iowa State University, Ames, IA, USA
Dependable Computing and Networking Laboratory
Row Faults
Column Faults
Systolic ArrayWeight Matrix
Row Faults: DNNs are resistant to row faults till an extentColumn Faults: They have have very high impact onnetwork accuaracy (if the column is used)
Mitigation Stratgies: Matrix Transpose, Array Reduction
Impact of Faults
Energy-Efficient Near Sensor Convolution using Pulsed Unary ProcessingM. Hassan Najafi, S. Rasoul Faraji, Kia Bazargan, and David Lilja
• Near-sensor convolution has many applications in IoT• Conventional fixed-point designs: Fast and accurate, but complex and costly• SC-based designs: Low-cost, but low accuracy, long latency, high energy consumption• Analog-to-digital (ADC) and analog-to-stochastic (ASC) converters are also costly
v Traditional designs inefficient for near-sensor processing
• Pulsed Unary Processing• Introduced recently for high-performance processing using stochastic logic• Inputs represented using PWM signals, duty cycle determines the value
• Proposed Design • Input data are converted to inharmonic PWM signals • AND gates are used to do the multiplications• Outputs of AND gates are accumulated using an active integrator
• Evaluation• Significant improvement in the hardware area cost, power, and energy consumption• Avoiding costly ADCs
ASAP 2019
Example of multiplication using PWM signals
An Efficient Application Specific Instruction Set Processor (ASIP) for Tensor ComputationHuang Weipei, CHEUNG Chak-Chung Ray, YAN Hong
• Definition - a processor which have their instruction sets augmented and tailored towards a specific application.
• Advantage – programmable, more flexible, lower cost and time for implementation shortened; higher power efficiency
• Current application-specific processors – TPUinvented by Google; China Cambricon released the Cambrian processor for computer vision and AI
Non-critical computations
Users
Tensor-Matrix, Tensor-Vector
operations
Computation partition
Program design
Software HardwareOrders
Data & instructions
Data
Algorithm flow
Users
Fig. Typical design flow on ASIP system
Instruction Fetch
Instruction buffer
Decoder
Loading data
Loading addr
CPU
Inte
rface
Processing core
Inputinstructions
Input size info.
startfinish
Size buffer
Control Interface
Controlunit
Fetch Decoder
Mem
ory
Inte
rface
Adress generator
Size reg
Processing modules
Current instruction
Control signal
Memony addressData signal
address bus
data bus
Fig. ASIP architecture for tensor processing
Running at 111MhZPower consumption: 1.87W~14ms for CP decomposition of 40*40*1200 tensor
MITRACA | 1/2
MITRACA: Manycore Interlinked TorusReconfigurable Accelerator Architecture
The 30th IEEE International Conference on
Application-specific Systems, Architectures and Processors
Riadh Ben Abdelhamid, Yoshiki Yamaguchi and Taisuke Boku
University of Tsukuba, Japan
Graduate School of Systems and Information Engineering
Department of Computer Science
July 15, 2019
MITRACA | 2/2
Overview of the proposed overlay architecture
DeltaNet: Differential Binary Neural NetworkYuka Oba*, Kota Ando*, Tetsuya Asai*, Masato Motomura†, and Shinya Takamaeda-Yamazaki*‡
(*Hokkaido University, †Tokyo Institute of Technology, ‡JST PRESTO)
#103
Information lossBNN’s Disadvantage
Idea: Binarization of difference
Effective use of information before binarization.
MOTTAINAI!!
Proposed method
Using Residue Number Systems to Accelerate Deterministic Bit-stream MultiplicationKamyar Givaki, Reza Hojabr, M. Hassan Najafi, Ahmad Khonsari, M.H. Gholamrezayi, Saeid Gorgin, Dara Rahmati
• Deterministic methods of SC• Conventional SC: low accuracy, long latency, and high energy consumption• Recently proposed Deterministic methods of SC: Fully accurate but still long latency and high energy consumption• Guarantee that each bit of the first bit-stream sees each bit of the second bit-stream exactly once
v Conventional SC designs are not suitable for applications that require accurate computation
• Three Deterministic methods of SC• Rotation of bit-streams• Clock dividing bit-streams • Using bit-streams with relatively prime lengths
• Proposed Design • Combines the idea of deterministic SC with Residue Number System (RNS)
• Residues have lower bit-widths -> exponentially shorter bit-streams• The final results are in the RNS format -> need to be converted back to binary• Number generators can be shared between all of the computation lanes.
• Evaluation• Smaller bit-stream generators• Significant improvement in terms of computation time and energy consumption compared to previously
proposed deterministic methods of SCASAP 2019
www.kalrayinc.com
Nicolas Brunie, Christoph Lauter, Guillaume Revy
Precision Adaptation for Fast and Accurate Polynomial Evaluation Generation
ASAP 2019, Cornell University, July 15th 2019
¤2019 Kalray SA All rights reserved. Confidential information Page 2
Application of “computing just right” to polynomial evaluation
Summary: • Composing various tools
• CGPE • Metalibm-Lutetia • Metalibm-Lugdunum
• Exploring imp. space • Generating implementation • Generating error certificate
Target sine zeros exp sinh
10 0.70 0.73 1.07 1.08
55 1.10 - 1.42 0.98
85 - - 2.73 3.48