virtual and physical cellular architectures for kilo-processor chip computers tamás roska hungarian...

Virtual and Physical Cellular Architectures for Kilo-processor Chip Computers Tams Roska Hungarian Academy of Sciences and Pzmny P. Catholic University, Budapest, Hungary IEEE CNNA 2010

A new era in computing starts NOT just parallel NOT using the present algorithms NOT massaging the nanoscale devices to repeat the functional primitives of CMOS building blocks Running time is NOT the only measure There are NO efficient tools Key: understanding spatial-temporal algorithmic dynamics Cellular Wave Computers: one way of thinking

Table of Contents The technology trend The six prototype platforms Virtual and Physical Cellular Machines and the Design Scenario Processor and memory array representations Design tools and principles Qualitative theory of operators with local connectedness New principles of Computational Complexity

The Technology Trend

~ 30 nm : over 10 billion transistors Severe limits for clock speed and power dissipation 25 k mixed mode processors and sensors on a cellular visual microprocessor at 180 nm 45 nm: 70 k logic- and 2k arithmetic- processors about 1 million 8 bit microprocessors could be placed (~5 Billion transistors)...Why not? Wire delay is bigger than gate delay, hence communication speed is limited (synchrony radius) Hence The precedence of Locality ~ i.e. Cellular, mainly locally connected architecture is a must for high computing power

International Technology Roadmap for Semiconductors, ITRS/2007 /2009

but it also doesnt scale terrible well.

The six prototype platforms - related new products

1.CELL Broadband Engine Architecture (CBEA) and high-end supercomputer Heterogeneous, multi-core CELL Multiprocessor chip 241M transistor, 235mm2 200 GFlops (SP) @3.2GHz 200 GB/s bus (internal) @ 3.2GHz dual XDR controller (25.6GB/s) two configurable interfaces (76.8GB/s) Power Processor Element (PPE) General purpose processor Synergistic Processor Element (SPE) SIMD-only engine fast access to 256KB local memories Globally coherent DMA to transfer data Racks, cabinets, PetaFlops systems

2. FPGAs three cell arrays: logic- and arithmetic- processors, and memories XILINX VIRTEX 6 FPGA Over 70k Advanced Silicon Modular Logic Blocks in a cellular array Over 2k DSP slices in a cellular array 36 kB memory blocks in a cellular array Application Specific array components Logic and arithmetic processor arrays and memory arrays are tiled on a 2D plate Local directional preference (stream)

Xilinx Virtex-5 FPGA architecture

3. Stacked arrays GPU: Graphic Programming Units Fermi ( NVIDIA -2010): - in 32 core units / 16 kB local memory - 512 cores (16x32) - IEEE 754-2008 32-bit and 64-bit - Full 32-bit integer path with 64-bit extensions - Memory access instructions - transition to 64- bit address

Fermis 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion(execution units), and light blue portions (register file and L1 cache).

In each core an integer/logic (64 bit) unit and an arithmetic unit (32/64 floating point) Local and Global memory in a cache architecture Possibility to use several 512 core chips Fully cellular architecture Physical Cellular Machine

SENSORY INTEGRATED CHIPS 4. WITH MIXED MODE CELL PROCESSORS Eye-RIS 5. WITH DIGITAL CELL PROCESSORS XENON

Eye-RIS v.1.2, AnaFocus Ltd.

EyeRIS camera computer - 25 k processors with optical sensors, 1000 frame per sec AnaFocus Ltd. Seville Many kilo processor chips using cellular architecture

XENON V3 (MDA SBIR Phase II Project: ROIC No2, 64x64 prototype) Topographic sensor-processor arrangement (integrated InGaAs/InP sensors are not shown in the photo) Photo of the ASIC processor layer: Processors: 64 (8x8) Sensors: 4096 (64x64) Processing speed (depends on algorithmic complexity): > 1000 fps Programming: general purpose EUTECUS, Inc. - proprietary

6. Cellular big cores: Intel Cloud 48 - 2010 48 IA processors in cellular communication scheme

Integrated Systems Bi-i systems with different sensor inputs and outputs and mobile versions Desktop high-end systems VISCUBE 3D integrated circuits

Desktop Megaprocessor Computer

Processor array (160x120) Mixedsignallayer Memory array (32x24) Digitalproc.layer Viscube 3D architecture Sensor layer (320x240) Digitalmemorylayer Processor array (32x24) AD 8 bit/pixel (gray value) Proc. parallel 1 bit/pixel (mask) Pixel Parallel BB ROI ROI parallel Eutecus-SZTAKI-Anafocus ONR Grant

Virtual and Physical Cellular Machines and the Design Scenario

The many-core Cellular Wave Computer Components The architectural element is defined as A processor array placed on a 2D or 3D grid The processors communicate with a delay proportional with the distance. Typically two speed classes; a local fo within a synchrony radius r, and a global Fo via a Manhattan type bus system. In case of continuous time dynamics f ~1/T, T being the dominant time constant The local and global clock speeds are adjusted to keep the power dissipation within prescribed limit Pd The Virtual Cellular architecture of a single array is shown, for a 2D case, in Figure 1 (hiding physical details)

New principles the role of the geometric address of a single processor in an array is introducing new algorithmic principles: the precedence of geometric locality (due to the physical and logical locality), i.e. the cellular character granular, topographic, etc. cellular wave dynamics as an instruction streaming activity pattern in logic Do NOT parallelize existing single processor algorithms invent new cellular algorithms

The physical constraints on communication speed and power dissipation The basic constraints are: wire delay and power dissipation. A side constraint is the number of communication pins (contacts). The execution of a logic, arithmetic or symbolic elementary array instruction is defined via r input (u(t)), m output (y(t)) and n state (x(t)) variables (t is the time instant). This unit is characterized by its s, surface/area, e, energy, f, operating frequency, w = e f local power dissipation, and the signals are traveling on a wire with length L, width q, and with speed v q introducing a delay of D = l v q

cores can be placed on a single Chip, typically in a rectangle grid, with R input and Q output physical connectors typically at the corners of the Chip, altogether there are K input/output connectors. The maximal value of dissipation of the Chip is W. The physics is represented by the maximal values of , K, and W (as well as the operating frequency). The operating frequency might be global for the whole Chip Fo, or could be local within the Chip, fo, f i (some parts might be switched off, f i = 0)

Virtual and Physical Cellular Machines The Virtual Cellular Machine is the modern version of the Virtual Memory (i) to hide the physical details of the huge number of processing elements (cores, cells, threads, etc.), (ii) to provide the framework for designing the algorithms with elementary array instructions, and (iii) to serve as the starting phase for the Virtual to Physical Cellular Machine mapping.

Virtual Cellular Machines Its computational building blocks are mainly arrays, composed of operator and memory elements, acting in space and time Heterogeneous and homogeneous computational blocks Building blocks have no physical constraints - size, bandwidth, speed, power dissipation, except the two classes of memory access local and global)

A typical logic elementary array instruction (LA) is a binary logic function on n variables, or a memory look-up table array A typical arithmetic/analog elementary array (AA) instruction is a multiply and accumulate (add) term (MAC core) array, A symbolic elementary array instruction (SA) might be a string manipulation core array (e.g. a P system) a complex cell based array instruction (XA), hosting cells with all the three above types of data and instructions An 8, 16, or 32 bit microprocessor could be considered as well as an elementary array instruction with iterative or multi-thread implementation.

A Virtual Cellular Machine is composed of five types of building blocks: (i) cellular processor arrays/layers, CP, with simple (L, or A, or S type) or complex (X) cells and their local memories, these are the protagonist building blocks, (ii) classical digital stored program computers, P (microprocessors), (iii) multimodal topographic (T) or non-topographic inputs and outputs, I/O (e.g. scalar, vector, or matrix signals), (iv) global memories of different data types M, organized in qualitatively different sizes and access times (e.g. cache memories), and (v) interconnection pathways B (busses).

M Pn::P1Pn::P1 P0P0 M1M1 M2M2 I OI O CP1/ 1 CP2/ h CP2/ 1 T Input 2D T Input 2D Simple Example: Global system control & memory B0B0 CP1/ g B0B0 B0B0 b0b0 b1b1 b2b2 F0F0...

tasks, the algorithms to be implemented, are defined on the Data/Memory representations of the Virtual Cellular Machine Various data representations for Topographic (e.g. picture, image, PDE, molecule dynamics) and Non-topographic (e.g. Markov processes, algebra, number theory) problems

Physical Cellular Machines the model for the physical implementation of system architectures A) An algorithm with input, state and output vectors having real/ arithmetic, binary/digital logic, and symbolic variables(typically implemened via digital circuits). B) A real valued state and output dynamic system with analog/continuous or arithmetic variables (typically implemented via mixed mode/analog-and-logic circuits and digital control processors) C) A physical dynamic entity with well defined geometric layout and I/O ports (function in layout) (typical implementations are CMOS and/or beyond CMOS nanoscale designs, or optical architectures with programmable control) We have three elementary programmable Cell processor (cell core) types used in array implementations

Physical Cellular Machines The above cell processors have their own physical parameters, hence, the arrays of them, in similar building arrangements like in the Virtual Cellular Machines, will be defined by given physical parameters as the Size Bandwidth/latency Operation speed Power dissipation It might have the same or a different architecture - compared to the Virtual Cellular Machine

The geometry of the architectures are reflecting the physical layout of the chips or systems. A building block could be implemented as a separate chip or a part of a chip. This architectural geometry defines also the communication (interacting) speed ranges. Hence physical closeness means higher speed ranges. The spatial location or topographic address of each elementary cell processor or cell core, as well as that of each building block within the architecture, and the communication bandwidth, plays a crucial role. In addition to the number of processors, these are the most dramatic differences compared to classical computer science.

The Design Scenario algorithms Processor / memory topography Data /object topography Task/Problem/ Workload Physical Implementation Virtual Cellular Machine Physical Cellular Machine COMPILER

Processor and memory/Data array representations

logic / symbolic array Array Signals, variables, memory Processors logic/ symbolic processor logic / symbolic valuearithmetic/analog processor arithmetic/analog array arithmetic/analog value arithmetic/analog processor array logic/symbolic processor array

FPGA organization

1c 1D 2D 200c. ~50c. GPU organization 1c. 10c. 1c. 100c. CELL organization

Many sensors/data/memory units on one Cell Processor

Separate sensory/memory plane Flow architecture 1 2 K

Design Tools and Principles

Homogeneous cellular arrays Decomposition in 2D Decomposition in 1D Composing complex cells for a homogeneous FPGA architecture Self-organizing distributed control The CNN Universal Machine as a compiler Nontopographic data representation Spatial distribution of arithmetic operators with large number of terms

Equivalent transformations for finite size homogeneous cellular arrays (.Zarndy) Four different methods when the size of the Physical Cellular machine is smaller than that of the Virtual Cellular Machine

Simple example of a Physical Cellular Machine implemented on a single chip (FPGA) CP2 {30x80} CP2 {50x20} CP2 {50x40} CP2 {50x20} CP2 {20x30} CP2 {20x50} Chip size:100x80 processors: conditional data path > streaming activity pattern">

A compiler for FPGA implementation (Z.Nagy and P.Szolgay) Compiler: the FALCON architecture via the CNN Universal Machine. Special cases for Navier Stokes 2D, 2D, incompressible and compressible Multilayer retina model Introducing special FIFOs to overcome the bandwidth constraint and the global control: Cellular Distributed Self-organizing Control structure ==>> streaming activity pattern

Xilinx Virtex-5 FPGA architecture

Virtex-5 Logic Cell Element Configurable Logic Block (CLB) 2 slices

Arithmetic and Memory Cell Elements DSP Slice Dual-ported memory

Programmable Interconnect

The CNN State Equation Assumed that the input is constant or changing slowly Forward Euler discretization Feedback equation Feed-forward equation

The Falcon processor core Memory unit Contains a belt of the cell array Mixer Contains cell values for the next updates Template memory Arithmetic unit

The complex arithmetic unit for implementing the CELL dynamics 9 multipliers (3x3 case) Adder tree DSP slice built-in adders Fully pipelined 1 cell update / clock cycle

The Falcon array Each core processor works on a narrow slice of the image Each line computes one CNN iteration Results are shifted down to the next row processors Linear speedup

Performance compared to a conventional microprocessor *Estimated data

Inviscid, Adiabatic, Compressible flows on the CNN Universal Machine Euler equations: Total energy is defined as: Notations: t: time : Nabla operator : density v(u, v): velocity vector field p: pressure I: identity matrix E: total energy : ratio of specific heats

1 st and 2 nd order solution

NEW algorithms for non-topographic problems Particle filters (A.Horvth and M. Rsonyi) Topographic representation of the nontopographic data Random selection within topographic locality 1000 trajectories x 100 time instants x a few particles Convergence and stability both on Virtual and on an 8 bit Physical Cellular Machine

Arithmetic operations on big number of terms Intel Threading building Blocks (~ log n) Spatial-temporal optimization and their shapes on the cellular core (thread) array Shape filling optimization (~log n/N)

Shape filling optimization N Log N

4 parallel arrays fill factor: static/dynamic N Log N N Data arriving in parallel waves

Algorithm Representations UMF DIAGRAMS Dynamic Operational Graphs Dynamic Acyclic Graphs Dynamic process graphs Etc.

UMF diagrams A single cellular array/ layer defined by a standard CNN dynamics : U: input array, Xo Initial state array, z: threshold or mask array, Y: output array, : time constant or clock time, TEMk : local instruction

Algorithmic structures in terms of arrays/layers Cascade Parallel A typical parallel structure with two parallel flows is shown below, by combining them in the final layer

Optimization of algorithms represented by UMF diagrams Directed acyclic graphs representing UMF diagrams Genetic Programming with Indexed Memory, GP-IM (G. Pazienza)

Dynamic operational graphs Nodes are memories Branches are either the operators or communication paths Extremal graph problems

Qualitative theory of Operators with local connectedness

Fundamental analytic results for 1D Binary CNN (L.O.Chua et. al.) the richness of patterns A fundamental Theorem for oscillating CNN (F. Corinto, T.Roska, and M.Gilli): Any oscillating fully connected dynamical system array can be represented by a locally connected one Synchronization A globally connected array of stochastic oscillatior cells can also be implemented by locally connected ones (G. Mt, E..Horvth, E.Kptalan, A. Tunyagi, Z.Nda, and T. Roska) - level of synchrony - size of local neighborhood

Logic operators Any 2D Boolean operator can be represented by a sequence of locally connected standard CNN Grammar cells Locally connected simple grammar cell arrays are equivalent to Turing Machines (E. Csuhaj Varju et.al.) True Randomness in space True random spatial bit patterns can be generated with standard CNN Universal Machine and spatially correlated local noise (M. Ercsey-Ravasz, Z. Nda, and T. Roska)

New principles of Computational Complexity

The algorithmic and physical complexity measures of these many core architectures will be eventually different compared to the single processor systems abandoning the asymptotic framework (what is big? 1 million?). With cores and M total memory the finite virtual algorithmic complexity C a = C a (, M ) The finite physical computational complexity of an algorithm is measured by a proper mix of speed, power, area, and accuracy.

A new era in Computer Science Computer engineering begins.

virtual and physical cellular architectures for kilo-processor chip computers tamás roska hungarian...

Documents

address slide

fpga architecture slide

technology trend slide

local memory

way of thinking slide

memory arrays

high computing power

arithmetic processor