virtual and physical cellular architectures for kilo-processor chip computers tamás roska hungarian...

Click here to load reader

Post on 19-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Virtual and Physical Cellular Architectures for Kilo-processor Chip Computers Tams Roska Hungarian Academy of Sciences and Pzmny P. Catholic University, Budapest, Hungary IEEE CNNA 2010
  • Slide 2
  • A new era in computing starts NOT just parallel NOT using the present algorithms NOT massaging the nanoscale devices to repeat the functional primitives of CMOS building blocks Running time is NOT the only measure There are NO efficient tools Key: understanding spatial-temporal algorithmic dynamics Cellular Wave Computers: one way of thinking
  • Slide 3
  • Table of Contents The technology trend The six prototype platforms Virtual and Physical Cellular Machines and the Design Scenario Processor and memory array representations Design tools and principles Qualitative theory of operators with local connectedness New principles of Computational Complexity
  • Slide 4
  • The Technology Trend
  • Slide 5
  • ~ 30 nm : over 10 billion transistors Severe limits for clock speed and power dissipation 25 k mixed mode processors and sensors on a cellular visual microprocessor at 180 nm 45 nm: 70 k logic- and 2k arithmetic- processors about 1 million 8 bit microprocessors could be placed (~5 Billion transistors)...Why not? Wire delay is bigger than gate delay, hence communication speed is limited (synchrony radius) Hence The precedence of Locality ~ i.e. Cellular, mainly locally connected architecture is a must for high computing power
  • Slide 6
  • International Technology Roadmap for Semiconductors, ITRS/2007 /2009
  • Slide 7
  • but it also doesnt scale terrible well.
  • Slide 8
  • The six prototype platforms - related new products
  • Slide 9
  • 1.CELL Broadband Engine Architecture (CBEA) and high-end supercomputer Heterogeneous, multi-core CELL Multiprocessor chip 241M transistor, 235mm2 200 GFlops (SP) @3.2GHz 200 GB/s bus (internal) @ 3.2GHz dual XDR controller (25.6GB/s) two configurable interfaces (76.8GB/s) Power Processor Element (PPE) General purpose processor Synergistic Processor Element (SPE) SIMD-only engine fast access to 256KB local memories Globally coherent DMA to transfer data Racks, cabinets, PetaFlops systems
  • Slide 10
  • 2. FPGAs three cell arrays: logic- and arithmetic- processors, and memories XILINX VIRTEX 6 FPGA Over 70k Advanced Silicon Modular Logic Blocks in a cellular array Over 2k DSP slices in a cellular array 36 kB memory blocks in a cellular array Application Specific array components Logic and arithmetic processor arrays and memory arrays are tiled on a 2D plate Local directional preference (stream)
  • Slide 11
  • Xilinx Virtex-5 FPGA architecture
  • Slide 12
  • 3. Stacked arrays GPU: Graphic Programming Units Fermi ( NVIDIA -2010): - in 32 core units / 16 kB local memory - 512 cores (16x32) - IEEE 754-2008 32-bit and 64-bit - Full 32-bit integer path with 64-bit extensions - Memory access instructions - transition to 64- bit address
  • Slide 13
  • Fermis 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion(execution units), and light blue portions (register file and L1 cache).
  • Slide 14
  • In each core an integer/logic (64 bit) unit and an arithmetic unit (32/64 floating point) Local and Global memory in a cache architecture Possibility to use several 512 core chips Fully cellular architecture Physical Cellular Machine
  • Slide 15
  • SENSORY INTEGRATED CHIPS 4. WITH MIXED MODE CELL PROCESSORS Eye-RIS 5. WITH DIGITAL CELL PROCESSORS XENON
  • Slide 16
  • Eye-RIS v.1.2, AnaFocus Ltd.
  • Slide 17
  • EyeRIS camera computer - 25 k processors with optical sensors, 1000 frame per sec AnaFocus Ltd. Seville Many kilo processor chips using cellular architecture
  • Slide 18
  • XENON V3 (MDA SBIR Phase II Project: ROIC No2, 64x64 prototype) Topographic sensor-processor arrangement (integrated InGaAs/InP sensors are not shown in the photo) Photo of the ASIC processor layer: Processors: 64 (8x8) Sensors: 4096 (64x64) Processing speed (depends on algorithmic complexity): > 1000 fps Programming: general purpose EUTECUS, Inc. - proprietary
  • Slide 19
  • 6. Cellular big cores: Intel Cloud 48 - 2010 48 IA processors in cellular communication scheme
  • Slide 20
  • Integrated Systems Bi-i systems with different sensor inputs and outputs and mobile versions Desktop high-end systems VISCUBE 3D integrated circuits
  • Slide 21
  • Desktop Megaprocessor Computer
  • Slide 22
  • Processor array (160x120) Mixedsignallayer Memory array (32x24) Digitalproc.layer Viscube 3D architecture Sensor layer (320x240) Digitalmemorylayer Processor array (32x24) AD 8 bit/pixel (gray value) Proc. parallel 1 bit/pixel (mask) Pixel Parallel BB ROI ROI parallel Eutecus-SZTAKI-Anafocus ONR Grant
  • Slide 23
  • Virtual and Physical Cellular Machines and the Design Scenario
  • Slide 24
  • The many-core Cellular Wave Computer Components The architectural element is defined as A processor array placed on a 2D or 3D grid The processors communicate with a delay proportional with the distance. Typically two speed classes; a local fo within a synchrony radius r, and a global Fo via a Manhattan type bus system. In case of continuous time dynamics f ~1/T, T being the dominant time constant The local and global clock speeds are adjusted to keep the power dissipation within prescribed limit Pd The Virtual Cellular architecture of a single array is shown, for a 2D case, in Figure 1 (hiding physical details)
  • Slide 25
  • Slide 26
  • New principles the role of the geometric address of a single processor in an array is introducing new algorithmic principles: the precedence of geometric locality (due to the physical and logical locality), i.e. the cellular character granular, topographic, etc. cellular wave dynamics as an instruction streaming activity pattern in logic Do NOT parallelize existing single processor algorithms invent new cellular algorithms
  • Slide 27
  • The physical constraints on communication speed and power dissipation The basic constraints are: wire delay and power dissipation. A side constraint is the number of communication pins (contacts). The execution of a logic, arithmetic or symbolic elementary array instruction is defined via r input (u(t)), m output (y(t)) and n state (x(t)) variables (t is the time instant). This unit is characterized by its s, surface/area, e, energy, f, operating frequency, w = e f local power dissipation, and the signals are traveling on a wire with length L, width q, and with speed v q introducing a delay of D = l v q
  • Slide 28
  • cores can be placed on a single Chip, typically in a rectangle grid, with R input and Q output physical connectors typically at the corners of the Chip, altogether there are K input/output connectors. The maximal value of dissipation of the Chip is W. The physics is represented by the maximal values of , K, and W (as well as the operating frequency). The operating frequency might be global for the whole Chip Fo, or could be local within the Chip, fo, f i (some parts might be switched off, f i = 0)
  • Slide 29
  • Virtual and Physical Cellular Machines The Virtual Cellular Machine is the modern version of the Virtual Memory (i) to hide the physical details of the huge number of processing elements (cores, cells, threads, etc.), (ii) to provide the framework for designing the algorithms with elementary array instructions, and (iii) to serve as the starting phase for the Virtual to Physical Cellular Machine mapping.
  • Slide 30
  • Virtual Cellular Machines Its computational building blocks are mainly arrays, composed of operator and memory elements, acting in space and time Heterogeneous and homogeneous computational blocks Building blocks have no physical constraints - size, bandwidth, speed, power dissipation, except the two classes of memory access local and global)
  • Slide 31
  • A typical logic elementary array instruction (LA) is a binary logic function on n variables, or a memory look-up table array A typical arithmetic/analog elementary array (AA) instruction is a multiply and accumulate (add) term (MAC core) array, A symbolic elementary array instruction (SA) might be a string manipulation core array (e.g. a P system) a complex cell based array instruction (XA), hosting cells with all the three above types of data and instructions An 8, 16, or 32 bit microprocessor could be considered as well as an elementary array instruction with iterative or multi-thread implementation.
  • Slide 32
  • A Virtual Cellular Machine is composed of five types of building blocks: (i) cellular processor arrays/layers, CP, with simple (L, or A, or S type) or complex (X) cells and their local memories, these are the protagonist building blocks, (ii) classical digital stored program computers, P (microprocessors), (iii) multimodal topographic (T) or non-topographic inputs and outputs, I/O (e.g. scalar, vector, or matrix signals), (iv) global memories of different data types M, organized in qualitatively different sizes and access times (e.g. cache memories), and (v) interconnection pathways B (busses).
  • Slide 33
  • M Pn::P1Pn::P1 P0P0 M1M1 M2M2 I OI O CP1/ 1 CP2/ h CP2/ 1 T Input 2D T Input 2D Simple Example: Global system control & memory B0B0 CP1/ g B0B0 B0B0 b0b0 b1b1 b2b2 F0F0...
  • Slide 34
  • tasks, the algorithms to be implemented, are defined on the Data/Memory representations of the Virtual Cellular Machine Various data representations for Topographic (e.g. picture, image, PDE, molecule dynamics) and Non-topographic (e.g. Markov processes, algebra, number theory) problems
  • Slide 35
  • Physical Cellular Machines the model for the physical implementation of system architectures A) An algorithm with input, state and output vectors having real/ arithmetic, binary/digital logic, and symbolic variables(typically implemened via digital circuits). B) A real valued state and output dynamic system with analog/continuous or arithmetic variables (typically implemented via mixed mode/analog-and-logic circuits and digital control processors) C) A physical dynamic entity with well defined geometric layout and I/O ports (function in layout) (typical implementations are CMOS and/or beyond CMOS nanoscale designs, or optical architectures with programmable control) We have three elementary programmable Cell processor (cell core) types used in array implementations
  • Slide 36
  • Physical Cellular Machines The above cell processors have their own physical parameters, hence, the arrays of them, in similar building arrangements like in the Virtual Cellular Machines, will be defined by given physical parameters as the Size Bandwidth/latency Operation speed Power dissipation It might have the same or a different architecture - compared to the Virtual Cellular Machine
  • Slide 37
  • The geometry of the architectures are reflecting the physical layout of the chips or systems. A building block could be implemented as a separate chip or a part of a chip. This architectural geometry defines also the communication (interacting) speed ranges. Hence physical closeness means higher speed ranges. The spatial location or topographic address of each elementary cell processor or cell core, as well as that of each building block within the architecture, and the communication bandwidth, plays a crucial role. In addition to the number of processors, these are the most dramatic differences compared to classical computer science.
  • Slide 38
  • The Design Scenario algorithms Processor / memory topography Data /object topography Task/Problem/ Workload Physical Implementation Virtual Cellular Machine Physical Cellular Machine COMPILER
  • Slide 39
  • Processor and memory/Data array representations
  • Slide 40
  • logic / symbolic array Array Signals, variables, memory Processors logic/ symbolic processor logic / symbolic valuearithmetic/analog processor arithmetic/analog array arithmetic/analog value arithmetic/analog processor array logic/symbolic processor array
  • Slide 41
  • FPGA organization
  • Slide 42
  • 1c 1D 2D 200c. ~50c. GPU organization 1c. 10c. 1c. 100c. CELL organization
  • Slide 43
  • Many sensors/data/memory units on one Cell Processor
  • Slide 44
  • Separate sensory/memory plane Flow architecture 1 2 K
  • Slide 45
  • Design Tools and Principles
  • Slide 46
  • Homogeneous cellular arrays Decomposition in 2D Decomposition in 1D Composing complex cells for a homogeneous FPGA architecture Self-organizing distributed control The CNN Universal Machine as a compiler Nontopographic data representation Spatial distribution of arithmetic operators with large number of terms
  • Slide 47
  • Equivalent transformations for finite size homogeneous cellular arrays (.Zarndy) Four different methods when the size of the Physical Cellular machine is smaller than that of the Virtual Cellular Machine
  • Slide 48
  • Simple example of a Physical Cellular Machine implemented on a single chip (FPGA) CP2 {30x80} CP2 {50x20} CP2 {50x40} CP2 {50x20} CP2 {20x30} CP2 {20x50} Chip size:100x80 processors: conditional data path > streaming activity pattern">
  • A compiler for FPGA implementation (Z.Nagy and P.Szolgay) Compiler: the FALCON architecture via the CNN Universal Machine. Special cases for Navier Stokes 2D, 2D, incompressible and compressible Multilayer retina model Introducing special FIFOs to overcome the bandwidth constraint and the global control: Cellular Distributed Self-organizing Control structure ==>> streaming activity pattern
  • Slide 51
  • Xilinx Virtex-5 FPGA architecture
  • Slide 52
  • Virtex-5 Logic Cell Element Configurable Logic Block (CLB) 2 slices
  • Slide 53
  • Arithmetic and Memory Cell Elements DSP Slice Dual-ported memory
  • Slide 54
  • Programmable Interconnect
  • Slide 55
  • Slide 56
  • The CNN State Equation Assumed that the input is constant or changing slowly Forward Euler discretization Feedback equation Feed-forward equation
  • Slide 57
  • The Falcon processor core Memory unit Contains a belt of the cell array Mixer Contains cell values for the next updates Template memory Arithmetic unit
  • Slide 58
  • The complex arithmetic unit for implementing the CELL dynamics 9 multipliers (3x3 case) Adder tree DSP slice built-in adders Fully pipelined 1 cell update / clock cycle
  • Slide 59
  • The Falcon array Each core processor works on a narrow slice of the image Each line computes one CNN iteration Results are shifted down to the next row processors Linear speedup
  • Slide 60
  • Performance compared to a conventional microprocessor *Estimated data
  • Slide 61
  • Inviscid, Adiabatic, Compressible flows on the CNN Universal Machine Euler equations: Total energy is defined as: Notations: t: time : Nabla operator : density v(u, v): velocity vector field p: pressure I: identity matrix E: total energy : ratio of specific heats
  • Slide 62
  • 1 st and 2 nd order solution
  • Slide 63
  • NEW algorithms for non-topographic problems Particle filters (A.Horvth and M. Rsonyi) Topographic representation of the nontopographic data Random selection within topographic locality 1000 trajectories x 100 time instants x a few particles Convergence and stability both on Virtual and on an 8 bit Physical Cellular Machine
  • Slide 64
  • Arithmetic operations on big number of terms Intel Threading building Blocks (~ log n) Spatial-temporal optimization and their shapes on the cellular core (thread) array Shape filling optimization (~log n/N)
  • Slide 65
  • Shape filling optimization N Log N
  • Slide 66
  • 4 parallel arrays fill factor: static/dynamic N Log N N Data arriving in parallel waves
  • Slide 67
  • Algorithm Representations UMF DIAGRAMS Dynamic Operational Graphs Dynamic Acyclic Graphs Dynamic process graphs Etc.
  • Slide 68
  • UMF diagrams A single cellular array/ layer defined by a standard CNN dynamics : U: input array, Xo Initial state array, z: threshold or mask array, Y: output array, : time constant or clock time, TEMk : local instruction
  • Slide 69
  • Algorithmic structures in terms of arrays/layers Cascade Parallel A typical parallel structure with two parallel flows is shown below, by combining them in the final layer
  • Slide 70
  • Optimization of algorithms represented by UMF diagrams Directed acyclic graphs representing UMF diagrams Genetic Programming with Indexed Memory, GP-IM (G. Pazienza)
  • Slide 71
  • Dynamic operational graphs Nodes are memories Branches are either the operators or communication paths Extremal graph problems
  • Slide 72
  • Qualitative theory of Operators with local connectedness
  • Slide 73
  • Fundamental analytic results for 1D Binary CNN (L.O.Chua et. al.) the richness of patterns A fundamental Theorem for oscillating CNN (F. Corinto, T.Roska, and M.Gilli): Any oscillating fully connected dynamical system array can be represented by a locally connected one Synchronization A globally connected array of stochastic oscillatior cells can also be implemented by locally connected ones (G. Mt, E..Horvth, E.Kptalan, A. Tunyagi, Z.Nda, and T. Roska) - level of synchrony - size of local neighborhood
  • Slide 74
  • Logic operators Any 2D Boolean operator can be represented by a sequence of locally connected standard CNN Grammar cells Locally connected simple grammar cell arrays are equivalent to Turing Machines (E. Csuhaj Varju et.al.) True Randomness in space True random spatial bit patterns can be generated with standard CNN Universal Machine and spatially correlated local noise (M. Ercsey-Ravasz, Z. Nda, and T. Roska)
  • Slide 75
  • New principles of Computational Complexity
  • Slide 76
  • The algorithmic and physical complexity measures of these many core architectures will be eventually different compared to the single processor systems abandoning the asymptotic framework (what is big? 1 million?). With cores and M total memory the finite virtual algorithmic complexity C a = C a (, M ) The finite physical computational complexity of an algorithm is measured by a proper mix of speed, power, area, and accuracy.
  • Slide 77
  • A new era in Computer Science Computer engineering begins.