CS717
Algorithm-Based Fault ToleranceMatrix Multiplication
Greg Bronevetsky
CS717
Problem at Hand
• Have matrices A and B• Want to compute their product: AB• Ask a matrix-matrix-multiply (MMM)
implementation to compute product• Answer: C
• Question: Is C the correct answer? How could we know for sure?
CS717
Algorithm-Based Fault Tolerance
• Encode input matrices via error-correcting code
• Run regular MMM algorithm on encoded matrices– Encoding invariant under MMM
• Naturally outputs encoded matrices
• Encoding guarantees:– If upto t errors in output, will detect error– If upto c<t errors in output, can decode correct
output matrix
CS717
Outline
Linear Error Correcting Codes
Algorithm-Based Fault Tolerance
ABFT = Linear Encoding of Matrices
CS717
Error Correcting Codes
• Map f: k n
– k-long data words n-long codewords– We use ={0, 1}
• Code of length n is a “sparse” subset of n
– Very few possible words are valid codewords
• Rate of code
Amount of information communicated by each codeword
n
k
n
Cr log
CS717
Minimum Distance
• Minimum Distance:
d() = Hamming distance• Hamming distance: number of spots where
words differ
• Measures difficulty of decoding/correcting corrupted codewords
),(min,
min yxddCyx
CS717
Detection and Correction
• Code may detect errors in dmin spots– No error can morph one codeword into another
• May correct errors in (dmin-1)/2 spots– Can still find “closest” codeword
• More details later…
Each codeword defines circle around itself of radius dmin/2
CS717
Linear Codes
• Codewords form linear subspace inside n
• In rowspace of generator matrix G:
•
1011100
1101010
1110001
G a (n=7, k=3) code
}{ mmessagesGmcodewords
CS717
Property 1
• Linear combination of any codewords is also a codeword:
For any x,yC, (x+y)C• Codeword*constant is codeword
For any zC, k*zC• <0,0…0> always a codeword
• Proof: basic properties of linear spaces
CS717
Property 2
• Minimum distance of linear code =
• Where
• Proof:
codewordinsofnumberweight '1()
)(min),(min,
min zweightyxddCzCyx
))((min,
)0()(.
)(min),(min
min
,,
zweightdThus
zzyxaspressedexbecanCzAnylinearisCncesiCzyxzLet
yxweightyxd
Cz
CyxCyx
CS717
Parity Check Matrix
• H: dual matrix to G– Contains basis of space orthogonal to G’s row
space– n-k dimentional space
• H is (n-k)xn
• Space defined as:
• Note: H also defines a linear code
nhhhH
cHCx
...
0.
21
CS717
Property 3
• dmin=min # of columns of H that can sum to 0
• Proof:
0. tosum tocolumnsfewer get t can' Thus,
minimal) be ' (otherwise
spacecolumn sH'in becan rdlighter wo No
0 H of columns of sum codeword)t (min weighH
codeword min weightin s1’ of #
codeword min weight ofweight
min
min
min
twouldnd
d
d
CS717
Property 4
• Minimum distance of linear code n-k+1• Proof
– Total n dimensions (since codewords are n-vectors)– G’s rowspace rank = k– Thus, H’s columspace rank = n-k– Thus, n-k+1 columns will be linearly dependent
• Add up to 0
– By Property 3, this is dmin
CS717
Outline
Linear Error Correcting Codes
Algorithm-Based Fault Tolerance
ABFT = Linear Encoding of Matrices
CS717
Encoding a Matrix
• Algorithm-Based Fault Tolerance introduced by Huang and Abraham in 1984
• Encode each row of matrix via extra column• Column entries = sums of matrix rows
T
mmm
eAeA
rowrow
rowrow
rowrow
row
row
row
A 1...11:::
22
11
2
1
CS717
Encoding a Matrix
• Encode each column of matrix via extra row• Row entries = sums of matrix columns
• Full Encoding:
TTn
n eAe
A
colcolcol
colcolcol1...11
...
...
21
21
AeeAe
AeATT
CS717
Detecting Errors
• Suppose matrix A is corrupted to matrix – entry âi,j is wrong
• Can detect error’s exact position: <i,j>
jjimi
ijimj
sumasumcolunequal
sumasumrowunequal
,1
,1
:
:
......
:.........
...ˆ...
:.........
ˆ ,
j
iji
sum
sumaAErroneous
CS717
Correcting Errors
• Can correct error using row or col checksum
jijijijijiji
jiji
niinijii
inijii
nijiii
aaaaaaCorrectionApply
aa
aaaaa
sumaaa
aaasumCorrectionRow
,,,,,,
,,
,1,,,1,
,,1,
,,1,
)ˆ(ˆˆˆ:
ˆ
)...()...ˆ...(
)...ˆ...(
)......(:
......
:.........
...ˆ...
:.........
ˆ ,,1,
j
inijii
sum
sumaaaAErroneous
CS717
Big Trick: Preservation of Encoding
• Column-encoded mtx * Row-encoded mtx = = Fully-encoded mtx
• Can check MMM computation by checking encoding of output
• If product matrix has an erroneous entry– Can detect– Can correct
ABeeABe
ABeABBeB
Ae
ATTT
CS717
Applications
• Matrix Multiplication– Given encoded A and B, – Check whether MMM result C (?=AB) has valid
encoding
• Matrix Factorization– Given a factorization A=WZ– Verify correctness by verifying encodings of
factors• Factors row- OR column-encoded• Can only detect, not correct errors
CS717
Weighted ABFT
• Oftentimes need to check row- or column-encoded matrices– Ex: factorization, data integrity check
• Can only detect errors in such matrices• Can we also correct?
• Yes, by generalizing to weighted checking rows/columns
CS717
Weighting
• Suppose we have d n-vectors w1…wd
• Can column-encode matrix A:
• Lets try out:
dAwAwA ...1
nw
w
...21
1...11
2
1
CS717
Weighted Error Detection
::.........
.........ˆ...
::.........ˆ
,1,2,1,1,,1, niiniinijii aawaawaaaAErroneous
!:,
)ˆ()ˆ(,
...)...ˆ...1(
......1...
)ˆ(...)...ˆ...(
.........
1
2
,,,,2
,1,2,,1,2
,,1,,1,2
,,,1,1,,1,1
,,1,,1,1
Detectedentryerroneousjs
sThus
aajajajsClearly
aawanajasLet
anajaaaw
aaaawaaasLet
aaaaaw
jijijiji
niinijii
nijiinii
jijiniinijii
nijiinii
CS717
Weighted Error Correction
::.........
.........ˆ...
::.........ˆ
,1,2,1,1,,1, niiniinijii aawaawaaaAErroneous
jijijijijiji
jiji
aaaasaaCorrection
aas
,,,,1,,
,,1
)ˆ(ˆˆˆ:
)ˆ(
• Weighted encoding Detects and Corrects single errors– Even for non full-encoding
CS717
Outline
Linear Error Correcting Codes
Algorithm-Based Fault Tolerance
ABFT = Linear Encoding of Matrices
CS717
“Surprise”
• But this is all just a linear code!• Generator matrix for above scheme:
k
G
11
::...
211
111
CS717
Generating Encodings
• Given m=<ai,1, ai,2, …, ai,k> as message word (or matrix row/column)
k
aaa niii
11
::...
211
111
... ,2,1,
li
k
lli
k
lniii alaaaa ,
1,
1,2,1, ...
CS717
Surprise??
• Not too surprising really• Why else would MMM preserve encoding?• Another possibility:
– Efficient: can be implemented via bit shifts
• Room open for using any linear code!
nw
w
2...22
1...1110
2
1
CS717
Error Detection/Correction in General
• To show for linear codes:– Can detect dmin errors
– Can correct (dmin-1)/2 errors
• Let be original codeword• Let be the corrupted codeword•
– e: error vector
mm̂
emm ˆ
CS717
Error Detection in General
• •
– s called the “syndrome vector”– Independent of original codeword
• Note: weight(e) <dmin since <dmin errors
• Thus:
• Detection: if , then ERROR
emm ˆHeHeHmemHmHsLet )(ˆ
0He
0ˆ mH
CS717
Error Correction in General
• Clearly e is correction vector– corrects error in
• Sufficient to prove:
weight(e)(dmin-1)/2 H is isomorphism: correction vectors syndrome vectors– i.e. for each correction vector (want to know)
unique syndrome vector
• Thus, possible to correct any error – may not be efficient
memm ˆ m̂
CS717
H is Onto
•
• weight(e) (dmin-1)/2 < dmin
• rank(H) = n-k (dmin-1)/2
• Thus, rank(H) weight(e) and He 0– Not enough 1’s in e to sum H’s columns to 0
• H maps onto its range• Thus,
sHemH ˆ
sHees .
CS717
H is 1-1
• Let e1 and e2 be correction vectors, e1 e2
• Suppose that:– weight(e1&e2) (dmin-1)/2 – He1 = He2 = s
• He1-He2 = H(e1-e2) = s-s = 0• And so, (e1-e2) is a codeword• Thus, weight(e1-e2) dmin
• But weight(e1&e2) (dmin-1)/2 and so weight(e1-e2) dmin-1
• Contradiction! e1 = e2
CS717
Other Encoding Schemes
• Linear codes preserved by matrix multiplication
• Presumably, fancier codes might be preserved by fancier computations
• Limit:– S. Winograd showed in 1962 that any code s.t.
f(xy) = f(x) f(y) has rate (k/n) or minimum weight0 as k
• How general can we get?• Do good solutions exist for small k?
– k=64 bits should be good enough
CS717
Summary
• For Matrix Multiplication can encode input via linear codes
• Solutions exist for more complex codes– Ex: Fourier Transforms
• On parallel systems must ensure:– No processor touches >1 element per row/column– Else, if one processor fails, encoding
overwhelmed with errors– To ensure this must modify algorithm
• Separate check placement theory