specialized clustering microprocessor
DESCRIPTION
Team Members: Dan Legorreta Moshe Looks Shobana Padmanabhan. Graduate Computer Architecture I Fall Semester 2005. Specialized Clustering Microprocessor. Final Presentation. Cluster. Project Goals. Long Term: - PowerPoint PPT PresentationTRANSCRIPT
Company
LOGO
Team Members:
Dan Legorreta Moshe Looks Shobana Padmanabhan
Specialized Clustering Microprocessor
Graduate Computer Architecture I Fall Semester 2005
Final Presentation
Company
LOGO Project Goals
Long Term:
Make it possible to cluster a high-volume stream of documents in real time.
This Course:
Develop a specialized microprocessor which runs a specific clustering algorithm very
efficiently.
45656 3
Company
LOGO Clustering
Problem Clustering algorithms are currently very slow
~ O(n2d) or worse Spend a lot of time “scoring” the clusters Scoring is done using “concept vectors” A “concept vector” is the average/summation of a
document vector
45656 4
Company
LOGO Solution
Develop a processor specifically designed for clustering
Base new processor on LEON2 Modify processor to improve clustering application performance
Synthesize and demonstrate improvement on liquid architecture platform Developed at Washington University, by ARL-FPX and DOC
45656 5
Company
LOGO
No compiler changes needed But C application changes are needed
Clustering as APB DeviceLEON2 Processor
Lower latency with faster data bus
Clustering as Coprocessor Device
Company
LOGO Circuit Design - IntroductionStep 1: Represent documents as bit vectors
Company
LOGO Circuit Design - IntroductionStep 1: Represent documents as bit vectorsStep 2: Compute the bit-wise sumsStep 3: Compute the dot-product for each document
Company
LOGO Circuit Design - IntroductionStep 3: Compute the dot-product for each documentStep 4: Analyze the quality of the cluster
Company
LOGO Circuit Design - IntroductionStep 4: Analyze the quality of the cluster
Company
LOGO Circuit Design - Limitations
Coprocessor Interface Inputs
Data 1 – 64 bits Data 2 – 64 bits Opcode – 9 bits
Outputs Result – 64 bits
Latency 4 clock cycles between each opcode
Cannot access main memory directly
Company
LOGO Circuit Design – 1st Approach
101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100110010011010101010101101000101010101010101001010001010110101110100111010101110101010101010100101001100001010110101010101011010001010101010101010010100010101101011101001110101011101010101010101001010
101010101101010101010110100010101010101010100101000101011010111010011101010111010101010101010010
101010101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100
101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100110010011010101010101101000101010101010101001010001010110101110100111010101110101010101010100101001100001010110101010101011010001010101010101010010100010101101011101001110101011101010101010101001010
101010101101010101010110100010101010101010100101000101011010111010011101010111010101010101010010
101010101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100
…..
N D
ocu
men
ts
< - 4000 Bits - >
sums
Coprocessor Commands: +3232 +32 +32 +32 +32 +32 +32 +32 +32 +32+ ………. = 32 * N
• Dot Product• Bitwise Sum
32 Additional Coprocessor commands ≈ .5 kb of memory needed per document
Company
LOGO Circuit Implementation
101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100110010011010101010101101000101010101010101001010001010110101110100111010101110101010101010100101001100001010110101010101011010001010101010101010010100010101101011101001110101011101010101010101001010
101010101101010101010110100010101010101010100101000101011010111010011101010111010101010101010010
101010101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100
101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100110010011010101010101101000101010101010101001010001010110101110100111010101110101010101010100101001100001010110101010101011010001010101010101010010100010101101011101001110101011101010101010101001010
101010101101010101010110100010101010101010100101000101011010111010011101010111010101010101010010
101010101010101010110100010101010101010100101000101011010111010011101010111010101010101010010100
…..
N D
ocu
men
ts
sums
< - 64 Bits - > < - 64 Bits - > < - 64 Bits - >
sums sums
Local Memory
101010101010101010111001010110010101001010101010101010101011
10101010101010101011
sums
10010101100101010010
1010101010101010101110010101100101010010101010101010101010111001010110010101001010101010101010101011
Dot-Product Circuit Partial Dot Product
101010101010101010111001010110010101001010101010101010101011
1010101010101010101110010101100101010010
1010101010101010101110010101100101010010101010101010101010111001010110010101001010101010101010101011
Full Dot Product
Company
LOGO Circuit Diagram
Cop
roce
ssor
Inte
rfa
ce
Control/Opcode Decode
FIFO(Data Input)
FIFO(Dot
Product)
Dot Product Circuit 1
Dot Product Circuit 2
Adder
Data 1Data 2
Opcode
Result
To LEON
Company
LOGO Circuit Testing
C program uses inline assembly to issue coprocessor/ dot-product opcodes
Compiled using Liquid development platform, to use sparc-elf-gcc for LEON
Platform creates IP packet based simulation files for ModelSim
Platform uses Synplicity to synthesize, Xilinx tools for building and place-n-route, and creates a bit file
To test on hardware, Liquid web interface to load program into SRAM start LEON (execute program) Read memory, to read results
Company
LOGO Circuit Testing – C Programlong long op1 = 0x0E000000000000001F;long long op2 = 0x190000000000000022;
// Inline assembly to load address of C vars into gen purpose registersasm(" mov %0, %%l0" : : "r" (&op1)); // Place op1 address in %l0asm(" mov %0, %%l1" : : "r" (&op2)); // Place op2 address in %l1 asm(" mov %0, %%l2" : : "r" (&result)); // Place result addres in %l2
// Load 64-bit op1 value into (%c0,%c1), 64-bit op2 value into (%c2,%c3)asm("ldd [%l0], %c0");asm("ldd [%l1], %c2");
// Load Interal Coprocessor Register 0 with 128-bit value (%c2 & %c1)asm(cpop1(HCLUST_STG1, "0x00", "0x02", "0x30"));
// Store 64-bit value from coproc register file location %c3 to result[0]asm("set 0x40040040, %l2"); // Set bit maskasm("std %c3, [%l2]"); //Store 64-bit value from coproc reg file at (%c1e,%c1f) to
result[0]
45656 16
Company
LOGO Circuit Testing
Test cluster contains 4 128 bit vectors Document 1 = 0x0E000000000000001F Document 2 = 0x190000000000000022 Document 3 = 0x220000000000000019 Document 4 = 0x1F000000000000000F
Expected scores Document 1 = 0x15 Document 2 = 0x0B Document 3 = 0x0C Document 4 = 0x17
Company
LOGO Circuit Testing - Inputs1st Set of Inputs:
0E
19
2nd Set of Inputs
22
1F
3rd Set of Inputs
1F
22
4th Set of Inputs
19
0F
Company
LOGO Circuit Testing - Outputs
1st Set of Results
2nd Set of Results
15 0B
0C 17
45656 19
Company
LOGO Performance Gain Estimate
Assuming n data points, data dimensionality (k) is 4000
Unaccelerated (lower bound) Summation: = 4,000*n cycles Dot Product: = 8,000*n cycles Total: = 12,000*n cycles
Accelerated (upper bound) Stage One: (1.5 * n) * k / 64 = 95*n cycles Stage Two: 2 * k / 64 = 125 cycles Stage Three: 4 * n = 4*n cycles Total: = 97*n + 125 cycles
Clustering is hierarchal, so cumulative speedup factor is: At least: 12,000 / (97 + 125) ~= 54 At most: 12,000 / 97 ~= 124
45656 20
Company
LOGO Questions
45656 21
Company
LOGO
Modified clustering application to run on Leon Challenges
25 MHz No OS and so no system calls, no I/O, only 4MB SRAM &
32KB icache & dcache each Recursion/ function calls restricted to 7 levels No debugger, other than reading from memory
Some of these addressed with a new cross-compiler but profiler not upgraded for the new cross-compiler
Progress – part 1: Clustering App
45656 22
Company
LOGO Progress – part 2: APB device
Implemented APB device interface Challenges
Huge Leon code-base Integrate the device, decode memory-mapped registers
Designed dot-product circuit to gain 50% speedup But for more speedup, we switched to Co-
processor interface
45656 23
Company
LOGO Progress – part 3: Co-processor
Extended hardware implementation to do scoring besides dot-product. Prof. Young helped with design.
Implemented both circuits (3 stages). Tested integration of stage1 with LEON.
45656 24
Company
LOGO Sample C code for APB
// Dot-product Device Registersint *PDevice_Init = (int *) 0x800000D0; // Initailize Deviceint *PDevice_Status = (int *) 0x800000D0; // Device Statusint *PDevice_Data = (int *) 0x800000D4; // Device Inputint *PDevice_Length = (int *) 0x800000D8; // Device Result
int recieveBuffer[300]; // Store the result
main() {
PDevice_Init[0] = 0;
if (PDevice_Status[0] == 0) { recieveBuffer[0] = PDevice_Data[0]; }}
45656 25
Company
LOGO APB Device
45656 26
Company
LOGO Co-processor