toward an advanced intelligent memory system university of illinois josep torrellas ...
TRANSCRIPT
![Page 1: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/1.jpg)
Toward an Advanced Intelligent Memory
System
University of Illinois
Josep Torrellas
http://[email protected]
FlexRAM
![Page 2: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/2.jpg)
People Involved
Students Other faculty
Michael Huang
Seung YooH. V. JagadishJoe RenauDavid Padua
Jaejin LeeDaniel Reed
![Page 3: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/3.jpg)
Technological Landscape
Merged Logic and DRAM (MLD):
•IBM, Mitsubishi, Samsung, Toshiba and others
•0.18 um (chips for 1 Gbit DRAM)
•Powerful: e.g. IBM SA-27E ASIC (Feb 99)
•Logic frequency: 400 MHz
•IBM PowerPC 603 proc + 16 KB I, D caches = 3%
•Further advances in the horizon
Opportunity: How to exploit MLD best?
![Page 4: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/4.jpg)
Terminology
Processor In Memory (PIM)
Intelligent Memory orIntelligent RAM (IRAM)
=
![Page 5: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/5.jpg)
Key Applications Benefit from HW
•Data Mining (decision trees and neural networks)
•Computational Biology (protein sequence matching)
•Molecular Dynamics (short-range forces)
•Financial Modeling (stock options, derivatives)
•Multimedia (MPEG-2)
•Decision Support Systems (TPC-D)
•Speech Recognition
All these are Data Intensive Applications
![Page 6: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/6.jpg)
Example App: DNA Matching
Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains
![Page 7: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/7.jpg)
How the Algorithm Works
Generate 50+ most-likely mutations
Pick 4 consecutive aminoacids from sample
![Page 8: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/8.jpg)
Example App: DNA Matching
If match is found: try to extend it
Compare them to every positions in the database DNAs
![Page 9: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/9.jpg)
How to Use MLD
1. Main compute engine of the machine
•Add proc to DRAM chip
•Include a vector processor or multiple processors
Incremental gains
Hard to program
UC Berkeley: IRAMNotre Dame: Execube, PetaflopsMIT: RawStanford: Smart Memories
![Page 10: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/10.jpg)
How to Use MLD (II)
2. Co-processor, special-purpose processor
•ATM switch controller
•Process data beside the disk
•Graphics accelerator
Stanford: ImagineUC Berkeley: ISTORE
![Page 11: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/11.jpg)
How to Use MLD (III)
3. Our approach: take the place of memory chipsin a workstation or server
•PIM chip processes the memory-intensive parts of the program
Illinois: FlexRAMUC Davis: Active PagesUSC-ISI: DIVA
![Page 12: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/12.jpg)
Our Solution: Principles
Extract high bandwidth from DRAM: Many simple processing units
Run legacy codes with high performance: Do not replace off-the-shelf uP in workstation Take place of memory chip. Same interface as DRAM Intelligent memory defaults to plain DRAM
Small increase in cost over DRAM: Simple processing units, still dense
General purpose: Do not hardwire any algorithm. No Special purpose
![Page 13: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/13.jpg)
Architecture Proposed
![Page 14: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/14.jpg)
The FlexRAM Memory System
Can exploit multiple levels of parallelism
For a high-end workstation:
•100’s of P.Mems in memory (e.g. IBM PowerPC 603)
•1 P.Host processor (e.g. Merced, IBM GP)
•100,000’s of very simple P.Arrays in memory
![Page 15: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/15.jpg)
Chip Organization
![Page 16: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/16.jpg)
Memory in one FlexRAM Chip
•64 Mbytes of DRAM organized as 16Mx32 bits
•Organized in 64 1-Mbyte banks
•1 single port
•Associated to 1 P.Array
•2 2-Kbyte row buffers (no P.Array cache)
•P.Array access to memory: 10 ns (row hit) or 20 ns (miss)
•On-chip memory bandwidth: 102 Gbytes/second
•Each bank:
![Page 17: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/17.jpg)
Memory in one FlexRAM Chip
Group of 4 P.Arrays share one 8-Kbyte, 4-portedSRAM instruction memory
•Holds the P.Array code
•Small because short code
•Aggressive access time: 1 cycle = 2.5 ns
![Page 18: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/18.jpg)
P.Array
•64 P.Arrays per chip. Not SIMD but SPMD
•32-bit integer arithmetic; 16 registers
•4 P.Arrays share one multiplier
•No caches, no floating point
•28 different 16-bit instructions
•Can access own 1 Mbyte of DRAM plus DRAM of left and right neighbors. Connection forms a ring•Broadcast and notify primitives: Barrier
![Page 19: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/19.jpg)
P.Mem
•2-issue static superscalar like IBM PowerPC 603
•16-Kbyte I, D caches
•Communication with P.Arrays:
•Executes serial sections
•Broadcast/notify or plain write/read to memory
•Communication with other P.Mems:•Memory in all chips is visible•Access via the inter-chip network•Must flush caches to ensure data coherence
![Page 20: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/20.jpg)
Issues
Communication P.Mem-P.Host:
•P.Mem cannot be the master of bus
•P.Host polls a register in Rambus interf. of master P.Mem
•P.Host starts P.Mems by writing register in Rambus interf.
•If P.Mem not finished: memory controller retries. Retries are invisible to P.Host
Virtual memory:
•They share a range of virtual addresses with P.Host
•P.Mems and P.Arrays use virtual memory
![Page 21: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/21.jpg)
Chip Architecture
![Page 22: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/22.jpg)
Basic Block
![Page 23: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/23.jpg)
Area Estimation (mm )
PowerPC 603+caches: 12
SRAM instruction memory: 34
64 Mbytes of DRAM: 330
P.Arrays: 96
Pads + network interf. + refresh logic 20
Rambus interface: 3.4Multipliers: 10
Total = 505
Of which 28% logic, 65% DRAM, 7% SRAM
2
VERY CONSERVATIVE
![Page 24: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/24.jpg)
Evaluation
P.Host P.Host L1 & L2 Bus & Memor Freq: 800 MHz L1 Size: 32 KB Bus: Split Trans Issue Width: 6 L1 RT: 2.5 ns Bus Width: 16 B Dyn Issue: Yes L1 Assoc: 2 Bus Freq: 100 MHz I-Window Size: 96 L1 Line: 64 B Mem RT: 262.5 ns Ld/St Units: 2 L2 Size: 256 KB Int Units: 6 L2 RT: 12.5 ns FP Units: 4 L2 Assoc: 4 Pending Ld/St: 8/8 L2 Line: 64 B BR Penalty: 4 cyc P.Mem P.Mem L1 P.Array Freq: 400 MHz L1 Size: 16 KB Freq: 400 MHz Issue Width: 2 L1 RT: 2.5 ns Issue Width: 1 Dyn Issue: No L1 Assoc: 2 Dyn Issue: No Ld/St Units: 2 L1 Line: 32 B Pending St: 1 Int Units: 2 L2 Cache: No Row Buffers: 3 FP Units: 2 RB Size: 2 KB Pending Ld/St: 8/8 RB Hit: 10 ns BR Penalty: 2 cyc RB Miss: 20 ns BR Penalty: 2 cyc
![Page 25: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/25.jpg)
Utilization
High P.Array Util Low P.Mem Util
![Page 26: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/26.jpg)
Utilization
Low P.Host Utilization
![Page 27: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/27.jpg)
Speedups
Constant Problem Sz Scaled Problem Sz
![Page 28: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/28.jpg)
Speedups
Varying Logic Frequency
![Page 29: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/29.jpg)
Programming FlexRAM
•FlexRAM programmed in C + extensions: C-Flex
•Library of Intelligent Memory Operations (IMOs)
Executed by P.Arrays or P.Mem
C subroutines that can be called from main pgm
Operate on large data sets with poor locality
•Library also contains plain subroutines
•Link program with IMOs or plain subroutines
![Page 30: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/30.jpg)
C-Flex Programming Extensions
•On processor_range: where the following code is executed
•Waitfor processor_range: processors waiting for others
•Release object
•Map object to processor_range: mapping of pages
•Flush(object), Flush&Inval(object): flush from cache
•Broadcast(address), Poll(), Receive(address), Notify()
•FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()
![Page 31: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/31.jpg)
Performance Evaluation
Hardware performance monitoring embedded in the chip
Software tools to extract and interpret performance info
![Page 32: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/32.jpg)
Current Status
Identified and wrote all applications Designed architecture based on apps & feasible
technology Conceived ideas behind language/compiler Need to do: chip layout and fabrication
development of the compiler Funds needed for:
processor core (P.Mem) chip fabrication hardware and software engineers
![Page 33: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/33.jpg)
Overall Goal
•Fabricate chips
•Build a workstation with an intelligent memory system
•Demonstrate significant speedups on real applications
•Build a compiler for the intelligent memory system
![Page 34: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas torrellas@cs.uiuc.edu FlexRAM](https://reader036.vdocuments.mx/reader036/viewer/2022062719/56649ed05503460f94bdf0c1/html5/thumbnails/34.jpg)
Conclusion
We have a handle on: A promising technology (MLD) Key applications of industrial interest
Real chance to transform the computing landscape