reconfigurable tensor processor fast, modular, low power, low … · 2020. 10. 20. · we have...

nnMAX™ Reconfigurable Tensor ProcessorFast, Modular, Low Power, Low Cost

Edge Inference Acceleration✓ 2 to >100 TOPS modular, scalable architecture✓ Optimized for tough, megapixel models ✓ Higher throughput from less hardware/$/power✓ Int8, Int16, BFloat16 - can mix between layers✓ Run any NN or multiple NNs✓ Programmed by Tensor�ow Lite/ONNX ✓ Available now for TSMC 16/12 & soon for GF12✓ Silicon proven soon in InferX™ X1 AI Inference Accelerator

Real World Applications Process Megapixel Images

nnMAX performs inference faster over real world modelsOur new InferX X1 chip uses a 2x2 array of nnMAX 1-D TPUs with 13MB (about 30mm2) in a chip that in total is 54 mm2. To the right we compare its performance on 2 real customer models and on YOLOv3 for 2 image sizes to the Nvidia Xavier NX, which is what almost all customers tell us is their alternative.

The Nvidia Xavier NX is 350mm2, 7 times larger than InferX X1. But InferX X1 can perform an inference of YOLOv3 at a similar latency and is 2-10x faster on real customer models (for applica-tions different from object detection and recogni-tion). We can benchmark your neural network model in a few days if you are interested.

We use YOLOv3 at 2 Megapixel to compare inference architectures.

Why not use ResNet-50? Because the default image size for ResNet-50 is just 224x224 pixels. Many people assume that if one accelerator is faster on ResNet-50 that it will also be faster on any other neural network model. What we have found that neural network model performance depends on many factors, especially the robustness of the memory system for megapixel images.

The chart to the right shows that for ResNet-50 the size of the weights is invariant with image size. But the size of the largest activation layer grows dramatically as the image size goes from 224x224 to megapixels. With 224x224 the activation (~1MB) is probably stored in on-chip SRAM, but with larger activations off-chip memory will be required.

AI + eFPGA ®

nnMAX Reconfigurable Tensor Processor 1K TileModular AI Inference Building Block

nnMAX 1K Tile Speci�cationsTSMC16FFC, 4.5mm2 (6.6 mm2 with 2MB of L2 SRAM attached).1.067GHz operation: 0.8V +/-10%, -40 to +125C Tj.2.1 TOPS/Tile (TOPS is an indicator of peak throughput not actual).

1024 MACs in clusters of 64 with weights stored in local L0 SRAM.Signed multiplier, con�gurable for INT8, INT16 BFloat16.INT8x8, 16x8 operate at full 933MHz. BFLoat16x16, INT16x16 operate at half rate.Numerics can be mixed layer by layer to maximize precision.

1920 LUT6 (3.1K LUT4 equivalent) and 736 EFLX IO.XFLX interconnect allows LUTs, IO and nnMAX 1-D TPUs to be connected to implement the data path and state machines required for high performance throughput layer by layer.

L1 SRAM is used to bring in weights for the next layer. There is also a con�guration memory bu�er that brings in the next con�guration. Between layers weights are shifted into L0 and con�guration bits are updated in 1000 cycles or less; then the next layer starts.

Test vectors give 99% stuck-at fault coverage and >80% AC coverage. The diagram above is an architectural representation, not a physical layout and is not to scale.

An eFPGA Optimized for InferencePeople use FPGA now in volume for inference, such as Microsoft Azure. Our silicon proven eFPGA is available with all-logic or with 40 DSP MACs of 22x22. For nnMAX we increased the MACs to 1024, made the MAC size optimal for inference, and reduced the area for logic.

Winograd AccelerationnnMAX Recon�gurable Tensor Processor implements special hardware for Winograd acceleration for INT8x8. 3x3 convolutions with a stride of 1 run >2 times faster. Weights expand to 1.8x larger after the Winograd Transform. To minimize DRAM bandwidth and L2 SRAM requirements, nnMAX stores weights in DRAM/SRAM in normal form and converts on-the-�y into Winograd form when brought into the nnMAX 1-D TPU. Activations are converted back when leaving the nnMAX 1-D TPU. In Winograd mode, operations are done with 12 bits in order to maintain full precision.

1-D TPU

EFLX Logic Tile EFLX DSP Tile nnMAX Infererence Tile

Data Path Recon�gured Layer by Layer

nnMAX Configuration and ArraysModular Design Enables Throughput Optimization

nnMAX Arrays with Variable L2 SRAM for the Throughput Your Model NeedsUsing the ArrayLinx interconnect, which has thousands of wires on each side of the nnMAX1024 tile, and which is automatically connected when two tiles abut, it is easy to generate an array of NMAX tiles of the size needed to give you the throughput for your model. Different models need different amounts of memory, so there is flexibility in how much L2 SRAM is connected between tiles. The ArrayLinx connections run over top of the L2 SRAM.

Each nnMAX Reconfigurable Tensor Processor tile can connect to 1, 2 or 4MB of L2 SRAM. Different models will benefit from different amounts of SRAM. More SRAM typical-ly results in less DRAM bandwidth at less cost.

So nnMAX Reconfigurable Tensor Processor arrays of 2 to >100 TOPs with the appropri-ate SRAM capacity and DRAM bandwidth can be quickly and easily generated for any application.

Neural models are made up of layers: YOLOv3 has >100. A single image needs over 200 billion MACs: each layer has billions of MAC operations. The XFLX interconnect is recon�gured for each layer to implement the data path for the layer, like a hardwired ASIC. The �rst con�gura-tion shows Layer 0 with 16 nnMAX in paral-lel each fed data directly from a separate L2 SRAM bank with results fed back into a separate bank. Each layer has a dedicated data path from SRAM to hardware to SRAM.

Neural networks are data �ow graphs and map directly onto nnMAX Recon�gurable Tensor Processor hardware �owing activations from SRAM through MACs, to logic for activation back to SRAM.

L1 SRAM brings in weights for the next layer while the current layer executes.

After a layer ends, the nnMAX Recon�gurable Tensor Processor tile is recon�gured in hundreds of cycles to implement the data path for the next layer.

When the number of physical MACs is su�ciently large it is possible to con�gure multiple layers in the nnMAX Recon�gurable Tensor Processor array simultaneously. For YOLOv3 the output of layer 0 is a 64MB activation: by con�guring both layer 0 and layer 1 together, the output of layer 0 feeds directly into layer 1. This eliminates the need to write 64MB to memory and back, reducing DRAM band-width and increasing throughput. The nnMAX Compiler automatically groups layers so as to maximize throughput.

www.�ex-logix.comCopyright © 2015-2020 Flex Logix Technologies, Inc. nnMAX, EFLX ,Flex Logix, XFLX, ArrayLinx, RAMLinx, nnMAX are Trademarks of Flex Logix.

All other names mentioned herein are trademarks or registered trademarks of their respective owner 10/2020

nnMAX CompilerSoftware, Development Boards

nnMAX CompilerOur software will map your neural network model in Tensorflow or ONNX onto our nnMAX Reconfigurable Tensor Processor dataflow architecture. NMAX’s native hardware allows for straightforward mapping of model to hardware, stage by stage, reconfiguring the SRAM to nnMAX to SRAM connec-tions and the reconfigurable state machines.

The nnMAX software automatically groups layers where it improves trhough-put and power.

InferX™ X1 Inference Co-Processor Chips/PCIe BoardnnMAX is being integrated into a co-processor chip which will tape-out soon. The InferX X1 chip will be a production product as well as a validation vehicle and a software development platform for users of nnMAX IP. PCIe Boards will be available with Software Drivers and the nnMAX Compiler.

Proven Management & Patented Technology • Our CEO has managed business units with up to 500 people and taken a startup from 4 people to IPO to $2 Billion Market Cap•Our Executives all have extensive industry experience and industry recognition, including the Outstanding Paper Award at ISSCC•Our technical team is a combination of silicon engineering, software development, architecture & system engineering•We have 25 issued US patents and many many more in application: these cover our multiple revolutionary interconnect technologies (XFLX, ArrayLinx, RAMLinx) as well as our nnMAX Recon�gurable Tensor Processor architectural inventions

Well Financed We have raised ~$27 Million from Lux Capital and Eclipse Ventures. Our eFPGA business is growing and pro�table and is helping fund the new inference technology.We have a strong cash balance and are growing to keep up with customer needs.

A performance modelling version of the nnMAX Compiler is available, support-ing TensorFlow Lite and ONNX.

This tool takes in any neural model and output performance metrics for any given nnMAX floorplan (number of MACs, amount of SRAM) and DRAM bandwidth.

On the right is output data for a given model and hardware specification.

A software driver is is written and is being used in emulation to interface to the nnMAX Reconfigurable Tensor Processor array and subsystem over PCIe. It will be available for customer use and can be adapted to other operating systems.

AI + eFPGA ®

Customer Input

InferX WorkflowCompiler

TFLite / ONNXModel

Inference Results

ExecutableHardware

DataStream

Run�meDevice Control

Applica�on Layer

STEP 1

reconfigurable tensor processor fast, modular, low power, low … · 2020. 10. 20. · we have...

Documents