vector computing: past, present and futurevector computing: past, present and future or everything...
TRANSCRIPT
DOE Conference on High-Speed Computing, April 2004
Vector Computing:Past, Present and Future
Or Everything You Always Wantedto Know About Vectors
(But Were Afraid to Ask)
Steve ScottCray Inc.
Slid
e 2
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Ou
tlin
e
•S
o w
hat e
xact
ly is
a “v
ecto
r co
mpu
ter?
”•
Vec
tor
adva
ntag
es•
Vec
tor
disa
dvan
tage
s•
Dis
pelli
ng m
yths
•
The
new
face
of v
ecto
r pr
oces
sing
•F
utur
e di
rect
ions
Slid
e 3
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Vec
tor
Pro
cess
ors
In a
dditi
on to
the
regu
lar
regi
ster
s an
d in
stru
ctio
ns, a
vec
tor
proc
esso
r:
(2)
Exe
cute
s ve
ctor
inst
ruct
ions
:
V5
V8
* V
9; p
erfo
rms
VL
elem
enta
l mul
tiplie
sV
10[A
5, A
8]; l
oads
V10
with
VL
wor
ds (
base
=A
5, s
trid
e=A
8)
Vec
tor
inst
ruct
ions
exp
ose
SIM
D d
ata-
leve
l par
alle
lism
VL
m0
m7
C/B
Bit
Mat
rix
0 1 2
MA
XV
L-1
V0
V1
V2
V31
(1)
Impl
emen
ts v
ecto
r re
gist
ers
Slid
e 4
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Exp
loit
ing
DL
P
•C
an e
xplo
it D
LP w
ith m
ultip
le “
pipe
s” (
or “
lane
s”)
–pa
ralle
l fun
ctio
nal u
nits
app
lied
to a
sin
gle
inst
ruct
ion
–a
give
n in
stru
ctio
n ex
ecut
es in
one
“ch
ime”
(V
L / #
pip
es)
e.g.
: a V
L 64
inst
ruct
ion
on a
4-p
ipe
mac
hine
take
s 16
cyc
les
to e
xecu
te
time
12
34
cycl
e
Typ
ical
Pro
ces
sor
Fou
r ope
ratio
ns p
er c
ycle
.E
ach
inst
ruct
ion
per
form
s on
e o
pera
tion.
oper
atio
ns
56
4-W
ay S
cala
r P
roce
sso
r4
oper
atio
ns p
er c
ycle
Eac
h in
stru
ctio
n pe
rfor
ms
one
oper
atio
n
time
12
34
cycl
e
56
78
Vec
tor
Pro
cess
or
16 o
per
atio
ns p
er c
ycle
.E
ach
inst
ruct
ion
pe
rfo
rms
64
ope
ratio
ns.
ope
ratio
ns
91
0
8 P
ipe
Vec
tor
Pro
cess
or
16 o
pera
tions
per
cyc
leE
ach
inst
ruct
ion
perf
orm
s 64
ope
ratio
ns
Slid
e 5
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
So
me
Oth
er V
ecto
r F
eatu
res
•G
athe
r/S
catte
rV
20[A
3, V
7]; A
3 is
bas
e of
arr
ay, V
7 is
vec
tor
of in
dice
s
•M
aske
d op
erat
ions
M2
V5
> S
0
V27
V27
+ V
5, M
2
; per
form
s op
s on
ly w
here
pre
dica
te is
true
•Io
ta in
stru
ctio
ns–
crea
te a
n in
dex
vect
or fr
om a
mas
kV
2C
IDX
(A4,
M6)
•C
ompr
ess
inst
ruct
ion
–co
mpa
ct s
elec
ted
elem
ents
of a
VR
V2
CID
X(A
4, M
6)
•B
it m
atrix
mul
tiply
–G
F2
mat
rix*m
atrix
or
vect
or*m
atrix
mul
tiplic
atio
n on
64x
64 a
rray
s of
bits
•La
rge
inte
ger
supp
ort
–ca
rry/
borr
ow r
egis
ter
hold
s te
mpo
rary
ove
r/un
derf
low
s
•T
rans
pare
nt a
ccel
erat
ion
of p
acke
d 32
-bit
com
puta
tion
Slid
e 6
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Ou
tlin
e
•S
o w
hat e
xact
ly is
a “v
ecto
r co
mpu
ter?
”•
Vec
tor
adva
nta
ges
•V
ecto
r di
sadv
anta
ges
•D
ispe
lling
myt
hs•
The
new
face
of v
ecto
r pr
oces
sing
•F
utur
e di
rect
ions
Slid
e 7
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Co
nve
yin
g P
aral
lelis
m t
o H
W
FU F
U FU F
U
FU F
U FU F
U
FU F
U FU F
U
FU F
U FU F
U
Har
dw
are
wit
h m
any
par
alle
l fu
nct
ion
al u
nit
s.
Co
mp
iler
anal
yzes
pro
gra
man
d d
isco
vers
par
alle
lism
Pro
gra
m D
epen
den
cy G
rap
h
Co
mp
iler
fro
nt
end
So
urc
eC
od
e
Sca
lar
ISA
Co
mp
iler
bac
k en
d…
and
th
en t
hro
ws
this
info
rmat
ion
aw
ay
Pro
cess
or
pip
elin
e…
mak
ing
HW
re-
dis
cove
r it
Slid
e 8
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Co
nve
yin
g P
aral
lelis
m t
o H
W
FU F
U FU F
U
FU F
U FU F
U
FU F
U FU F
U
FU F
U FU F
U
Har
dw
are
wit
h m
any
par
alle
l fu
nct
ion
al u
nit
s.
Co
mp
iler
anal
yzes
pro
gra
man
d d
isco
vers
par
alle
lism
Pro
gra
m D
epen
den
cy G
rap
h
Co
mp
iler
fro
nt
end
So
urc
eC
od
e
Sca
lar
ISA
Co
mp
iler
bac
k en
d…
and
th
en t
hro
ws
this
info
rmat
ion
aw
ay
Pro
cess
or
pip
elin
e…
mak
ing
HW
re-
dis
cove
r it
Vec
tor
ISA
Vec
tor
ISA
en
cod
esp
aral
lelis
m a
nd
co
ntr
ol
dep
end
ence
s ex
plic
itly
…so
har
dw
are
do
esn
’th
ave
to r
e-d
isco
ver
it
Slid
e 9
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Lo
w C
on
tro
l Co
mp
lexi
ty
•S
ome
sim
ple
vect
or in
stru
ctio
ns:
V1
[A1,
A5]
; loa
ds V
1 fr
om [A
1], s
trid
e A
5V
2[A
2, 1
]; l
oads
V2
from
mem
[A2]
, str
ide
1V
3V
1 +
V2
; add
s tw
o ve
ctor
reg
iste
rs[A
3,1]
V3
; sto
res
the
resu
lt to
mem
[A3]
, str
ide
1
Vec
tors
ena
ble
lots
of p
aral
lelis
m w
ith lo
w c
ompl
exity
ops/
sec
= (
cycl
es/s
ec)
* (in
strs
/cyc
le)
* (o
ps/in
str)
•12
8 lo
ads
•64
sto
res
•64
fpad
ds•
48 in
tege
r ad
ds•
16 d
ecre
men
ts•
16 c
ompa
res
•16
bra
nche
s
•35
2 to
tal i
nst
ruct
ion
s•
272
reg
iste
r re
nam
es
and
dep
end
ence
ch
ecks
•If
impl
emen
ted
in a
4-w
ay-u
nrol
led
scal
ar lo
op, w
ould
req
uire
:
Slid
e 10
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Go
od
Fit
wit
h IC
Tec
hn
olo
gy
•S
ingl
e cy
cle
reac
h on
chi
p ra
pidl
y sh
rinki
ngne
ed to
pro
vide
loca
lity
on c
hip
•M
ulti-
pipe
vec
tor
proc
esso
rs g
roup
reg
iste
rs a
nd fu
nctio
nal u
nits
gr
oups
into
loca
l clu
ster
s
VR
F
FU
VR
F
FU
VR
F
FU
VR
F
FU
VR
F
FU
VR
F
FU
VR
F
FU
VR
F
FU
Cro
ss P
ipe
Co
mm
un
icat
ion
•C
an e
asily
impl
emen
t ver
yla
rge
regi
ster
file
(10
00’s
of r
egis
ters
)–
regi
ster
file
bro
ken
into
pip
es
–ac
cess
es w
ithin
eac
h pi
pe a
re s
truc
ture
d (s
eque
ntia
l ele
men
ts)
–ca
n’t c
ome
clos
eto
this
with
uni
fied
scal
ar r
egis
ter
file
Slid
e 11
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Co
ncu
rren
ce a
nd
Lat
ency
To
lera
nce
•T
he “
Mem
ory
Wal
l”–
ratio
of m
emor
y la
tenc
y to
clo
ck c
ycle
con
tinue
s to
gro
w
proc
esso
rs n
eed
mor
e lo
ads
in fl
ight
to c
over
this
late
ncy
–sc
alab
ility
mak
es th
is w
orse
•V
ecto
r pr
oces
sors
pro
vide
lots
of c
oncu
rren
cy–
easy
to g
ener
ate
1000
’s o
f out
stan
ding
load
s–
can
hand
le n
on-u
nit s
trid
es a
nd ir
regu
lar
addr
essi
ng
–m
oder
n im
plem
enta
tions
can
dyn
amic
ally
tole
rate
var
iabl
e la
tenc
ies
exce
llent
fit w
ith s
cala
ble
DS
M s
yste
ms
•La
tenc
y to
lera
nce
vs. l
aten
cy a
void
ance
–ve
ctor
cac
hes
used
for
band
wid
th fi
lterin
g, n
ot la
tenc
y av
oida
nce
they
don
’t ha
ve to
be
as la
rge
to b
e ef
fect
ive
proc
esso
r is
hap
py w
ith c
ache
unf
riend
ly c
odes
Slid
e 12
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Su
mm
ary
of
Vec
tor
Ad
van
tag
es
•C
ompi
ler
prov
ides
dep
ende
nce
info
rmat
ion
to h
ardw
are
high
sin
gle
proc
esso
r pe
rfor
man
ce w
ith lo
w c
ompl
exity
•G
ood
fit w
ith IC
tech
nolo
gy tr
ends
:–
inde
pend
ent c
ompu
te e
ngin
es w
ith lo
cal r
egis
ters
–ve
ry la
rge,
sim
ple
regi
ster
file
•H
igh
proc
esso
r co
ncur
renc
y an
d la
tenc
y to
lera
nce
wor
ks w
ell w
ith c
ache
-unf
riend
ly c
odes
wor
ks w
ell i
n sc
alab
le s
yste
ms
•B
otto
m li
ne: f
ewer
, mor
e po
wer
ful p
roce
ssor
s–
redu
ces
“sur
face
-to-
volu
me”
ratio
s re
duce
s co
mm
unic
atio
n–
redu
ces
need
to s
cale
to la
rge
num
bers
of p
roce
ssor
s
Slid
e 13
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Ou
tlin
e
•S
o w
hat e
xact
ly is
a “v
ecto
r co
mpu
ter?
”•
Vec
tor
adva
ntag
es•
Vec
tor
dis
adva
nta
ges
•D
ispe
lling
myt
hs•
The
new
face
of v
ecto
r pr
oces
sing
•F
utur
e di
rect
ions
Slid
e 14
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Vec
tor
Dis
adva
nta
ges
•R
equi
res
arra
y-st
yle
DLP
–
not u
sefu
l on
cont
rol-i
nten
sive
cod
e–
need
to e
xpos
e pa
ralle
lism
in le
af r
outin
es•
note
: thi
s he
lps
scal
ar m
icro
s to
o, b
ut is
n’t r
equi
red
–“s
truc
tsof
arr
ays”
vs. “
arra
ys o
f str
ucts
”•
ther
e is
hop
e…
•M
ore
expe
nsiv
e, if
des
igne
d w
ith b
alan
ced
band
wid
thw
on’t
be c
ost e
ffect
ive
whe
n ba
ndw
idth
is n
ot n
eede
dw
on’t
be c
ost e
ffect
ive
on s
cala
r co
de, e
ven
if go
od s
cala
r pe
rf.
•E
cono
mie
s of
sca
le–
vect
ors
are
targ
eted
at s
cien
tific
com
putin
g, n
ot b
road
mar
ket
volu
mes
will
alw
ays
be s
mal
l
Slid
e 15
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Ou
tlin
e
•S
o w
hat e
xact
ly is
a “v
ecto
r co
mpu
ter?
”•
Vec
tor
adva
ntag
es•
Vec
tor
disa
dvan
tage
s•
Dis
pel
ling
myt
hs
•T
he n
ew fa
ce o
f vec
tor
proc
essi
ng•
Fut
ure
dire
ctio
ns
Slid
e 16
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Co
mm
on
Vec
tor
Myt
hs
•V
ecto
r co
mpu
ters
are
n’t s
cala
ble
(how
man
y tim
es h
ave
you
hear
d “v
ecto
r vs
.MP
Ps”
?)
–C
onfu
sing
pro
cess
or a
rchi
tect
ure
with
sys
tem
arc
hite
ctur
e–
In a
DS
M, v
ecto
rs a
ctua
lly fa
cilit
ate
scal
abili
ty
•V
ecto
r co
mpu
ters
are
pow
er h
ungr
y–
Con
fusi
ng p
roc.
arc
hite
ctur
e w
ith im
plem
enta
tion
tech
nolo
gy–
Act
ually
vec
tors
are
ver
y po
wer
effi
cien
tE
xam
ple:
Vec
tor
IRA
M
•V
ecto
r pr
oces
sors
hav
e be
en o
verr
un b
y M
oore
’s L
aw–
Con
fusi
ng p
roce
ssor
arc
hite
ctur
e w
ith s
yste
m a
rchi
tect
ure
and
impl
emen
tatio
n te
chno
logy
–C
MO
S v
ecto
r sy
stem
s be
nefit
from
Moo
re’s
Law
too
–In
fact
, may
ben
efit
mor
ein
the
futu
re…
Slid
e 17
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
“Cla
ssic
” V
ecto
r S
yste
ms
of
CR
I
Moo
re's
Law
vsT
radi
tion
al V
ecto
r Su
perc
ompu
ters
1987
YM
P19
83X
MP
1976
Cra
y-1
1991
C90
1995
T90
1999
T90
P
Moo
re's
Law
110100
1000
1000
0
1000
00
1000
000 19
7519
8019
8519
9019
9520
00
Yea
r of
Int
rodu
ctio
n
Peak MFLOPS
Slid
e 18
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
•Q
: W
hy d
idn’
t cla
ssic
vec
tor
syst
em
perf
orm
ance
impr
ove
at M
oore
’s L
aw r
ate?
•A
: Bec
ause
they
rel
ied
upon
flat
con
nect
ivity
to
glob
al s
hare
d m
emor
y, a
nd IC
con
nect
ivity
does
n’t i
mpr
ove
as fa
st a
s de
nsity
•E
xam
ple:
DR
AM
mem
ory
chip
s ov
er s
ame
time
perio
d
Slid
e 19
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
DR
AM
Mem
ory
Ch
ip P
erfo
rman
ce
1979
: S
tan
dar
d D
RA
M–
16K
bit
–1-
bit w
ide
inte
rfac
e
–5
Mb/
s un
iform
acc
ess
BW
–2
Mb/
s ra
ndom
acc
ess
BW
1999
: 20
0 M
Hz
SD
RA
M–
256
Mbi
t–
16 b
it w
ide
inte
rfac
e
–32
00 M
b/s
unifo
rm a
cces
s B
W
–10
00 M
b/s
rand
om a
cces
s B
W
1979
→19
99:
•16
000X
den
sity
incr
ease
•64
0X u
nifo
rm a
cces
s B
W in
crea
se
•50
0X r
ando
m a
cces
s B
W in
crea
se
•25
X le
sspe
r-bi
t mem
ory
band
wid
th
1979
→19
99:
•16
000X
den
sity
incr
ease
•64
0X u
nifo
rm a
cces
s B
W in
crea
se
•50
0X r
ando
m a
cces
s B
W in
crea
se
•25
X le
sspe
r-bi
t mem
ory
band
wid
th
Slid
e 20
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Vec
tor
Per
form
ance
Tre
nd
Moo
re's
Law
vsT
radi
tion
al V
ecto
r Su
perc
ompu
ters
1987
YM
P19
83X
MP
1976
Cra
y-1
1991
C90
1995
T90
1999
T90
P
Moo
re's
Law
110100
1000
1000
0
1000
00
1000
000 19
7519
8019
8519
9019
9520
00
Yea
r of
Int
rodu
ctio
n
Peak MFLOPS
DR
AM
BW
DR
AM
Den
sity
Slid
e 21
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Ou
tlin
e
•S
o w
hat e
xact
ly is
a “v
ecto
r co
mpu
ter?
”•
Vec
tor
adva
ntag
es•
Vec
tor
disa
dvan
tage
s•
Dis
pelli
ng m
yths
•
Th
e n
ew f
ace
of
vect
or
pro
cess
ing
•F
utur
e di
rect
ions
Slid
e 22
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Cra
y P
VP
•P
ower
ful v
ecto
r pr
oces
sors
•V
ery
high
mem
ory
band
wid
th
•N
on-u
nit s
trid
e co
mpu
tatio
n
•S
peci
al IS
A fe
atur
es
•M
oder
nize
d th
e IS
A
T3E
•E
xtre
me
scal
abili
ty
•O
ptim
ized
com
mun
icat
ion
•M
emor
y hi
erar
chy
•S
ynch
roni
zatio
n fe
atur
es
•Im
prov
ed v
ia v
ecto
rs
Ext
rem
e sc
alab
ility
wit
h h
igh
ban
dw
idth
vec
tor
pro
cess
ors
Cra
y X
1
Slid
e 23
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
New
Cra
y X
1 IS
A
•M
uch
larg
er r
egis
ter
set (
32x6
4 ve
ctor
, 64+
64 s
cala
r)•
Reg
ular
32-
bit i
nstr
uctio
ns•
Byt
e ad
dres
sabi
lity
and
sub-
wor
d ac
cess
sup
port
•A
ll op
erat
ions
per
form
ed u
nder
mas
k•
VL
inde
pend
ence
•D
esig
ned
to s
uppo
rt v
ecto
r re
nam
ing
•64
-an
d 32
-bit
mem
ory
and
arith
met
ic•
Rel
axed
mem
ory
cons
iste
ncy
mod
el p
lus
sync
h pr
imiti
ves
•A
tom
ic m
emor
y op
erat
ions
•C
ache
allo
catio
n hi
nts
Slid
e 24
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
•C
lass
ic v
ecto
r m
achi
nes:
–S
MP
par
alle
lism
–O
ptim
ize
for
loop
leng
thw
ith li
ttle
rega
rd fo
r lo
calit
y
•S
cala
ble
mic
ro-b
ased
mac
hine
s:–
Dis
trib
uted
mem
ory
para
llelis
m–
Opt
imiz
e fo
r lo
calit
yw
ith li
ttle
rega
rd fo
r lo
op le
ngth
•C
an n
o lo
nger
affo
rd a
n en
tirel
y di
ffere
nt p
rogr
amm
ing
mod
el!
•T
he C
ray
X1
is h
iera
rchi
cal,
DS
M m
achi
ne–
Rew
ards
loca
lity:
reg
iste
r, c
ache
, loc
al m
emor
y, r
emot
e m
emor
y–
Dec
oupl
ed m
icro
arch
itect
ure
perf
orm
s w
ell o
n sh
ort l
oop
nest
sC
ompa
tible
pro
gram
min
g m
odel
s (d
istr
ibut
ed m
emor
y)X
1 op
timiz
atio
ns w
ill a
lso
impr
ove
perf
orm
ance
on
scal
ar m
achi
nes
–H
owev
er, d
oes
requ
ire v
ecto
rizab
le c
ode
–M
ore
com
pile
r an
d to
ols
help
nee
ded
A S
imila
r C
od
e T
arg
et
Slid
e 25
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
PP
PP
$$
$$
PP
PP
$$
$$
PP
PP
$$
$$
PP
PP
$$
$$
MM
MM
MM
MM
MM
MM
MM
MM
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
IOIO
•F
our
mul
tistr
eam
proc
esso
rs (
MS
Ps)
, eac
h 12
.8 G
flops
•H
igh
band
wid
th lo
cal s
hare
d m
emor
y (1
28 D
irect
Ram
bus
chan
nels
)•
32 n
etw
ork
links
and
four
I/O
link
s pe
r no
de
51 G
flo
ps,
200
GB
/s
Cra
y X
1 N
od
e
Slid
e 26
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
•16
par
alle
l net
wor
ks fo
r ba
ndw
idth
•G
loba
l sha
red
mem
ory
acro
ss m
achi
ne
Inte
rco
nn
ecti
on
Net
wo
rk
NU
MA
Sca
lab
le u
p t
o 1
024
No
des
Slid
e 27
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
So
me
Des
ign
Ch
alle
ng
es
•H
ow d
o w
e to
lera
te e
ver
grow
ing
mem
ory
late
ncie
s?
•H
ow d
o w
e su
ppor
t add
ress
tran
slat
ions
and
ca
che
cohe
renc
e w
ith h
igh
band
wid
th v
ecto
r pr
oces
sors
?
Slid
e 28
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Dec
ou
ple
d V
ecto
rM
icro
arch
itec
ture
•D
ecou
pled
acc
ess/
exec
ute
and
dec
oupl
ed s
cala
r/ve
ctor
•S
cala
r un
it ru
ns a
head
, doi
ng a
ddre
ssin
g an
d co
ntro
l–
Sca
lar
and
vect
or lo
ads
issu
ed e
arly
–S
tore
add
ress
es c
ompu
ted
early
, sav
ed fo
r la
ter
use
–O
pera
tions
que
ued
and
exec
uted
late
r w
hen
data
arr
ives
•H
ardw
are
dyna
mic
ally
unr
olls
loop
s–
Sca
lar
star
ts o
n ne
xt lo
op b
efor
e cu
rren
t loo
p ha
s co
mpl
eted
–M
emor
y pi
pelin
e st
ays
full
of r
eque
sts
–S
peci
al s
ync
oper
atio
ns k
eep
pipe
line
full,
eve
n ac
ross
bar
riers
This
is k
ey to
mak
ing
the
syst
em p
erfo
rm w
ell o
n sh
ort l
oop
nest
s
Slid
e 29
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Mai
nta
inin
g D
eco
up
ling
Pas
t S
ynch
ron
izat
ion
Po
ints
Vec
tor
stor
e ad
dres
ses
com
pute
d ea
rly (
befo
re d
ata
is a
vaila
ble)
:–
sent
out
to th
e sh
ared
L2
cach
e, w
here
it m
odifi
es c
ache
sta
te–
late
r lo
ads
from
oth
er P
chi
ps c
an n
ow b
e pe
rfor
med
–lo
ads
only
wai
t if t
here
is a
true
conf
lict
Msy
ncco
ntro
l and
dat
a ba
rrie
r:
P0
….
St
Vx
….
Msy
nc
….
Ld
P1
….
St
Vx
….
Msy
nc
….
Ld
P2
….
St
Vx
….
Msy
nc
….
Ld
P3
….
St
Vx
….
Msy
nc
….
Ld
Wan
t to
prot
ect a
gain
st h
azar
ds, b
ut n
otdr
ain
mem
ory
pipe
line.
Slid
e 30
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Ad
dre
ss T
ran
slat
ion
•H
igh
tran
slat
ion
band
wid
th:
–sc
alar
+ fo
ur v
ecto
r tr
ansl
atio
ns p
er c
ycle
per
P c
hip
•R
emot
e (h
iera
rchi
cal)
tran
slat
ion:
–al
low
s ea
ch n
ode
to m
anag
e its
ow
n m
emor
y (e
ases
mem
ory
mgm
t.)
–T
LB o
nly
need
s to
hol
d tr
ansl
atio
ns fo
r on
e no
de
scal
es
610
6362
Mem
ory
regi
on:32
16
Pa
ge
Off
se
tV
irtu
al P
ag
e #
3115
Pos
sibl
e pa
ge b
ound
arie
s:
MB
Z
4748
useg
, kse
g, k
phys
64 K
B to
4 G
B
VA
:
Phy
sica
l add
ress
spa
ce:
450
36
Off
set
Nod
e
3546
47
PA
:
Mai
n m
emor
y, M
MR
, I/O
Max
64
TB
phy
sica
l mem
ory
Slid
e 31
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Cac
he
Co
her
ence
•G
loba
l coh
eren
ce, b
ut o
nly
cach
e m
emor
y fr
om lo
cal n
ode
–S
uppo
rts
SM
P-s
tyle
cod
es u
p to
4 M
SP
s
–R
efer
ence
s ou
tsid
e th
is d
omai
n co
nver
ted
to n
on-a
lloca
te•
Sca
labl
e co
des
use
expl
icit
com
mun
icat
ion
anyw
ay
–K
eeps
dire
ctor
y en
try
and
prot
ocol
sim
ple
–S
igni
fican
t rel
iabi
lity
bene
fits
for
larg
e sc
ale
syst
ems
•E
xplic
it ca
che
allo
catio
n co
ntro
l
–P
er in
stru
ctio
n hi
nts
–U
se n
on-a
lloca
ting
refs
to a
void
cac
he p
ollu
tion
•C
oher
ence
dire
ctor
y st
ored
on
the
M c
hips
(ra
ther
than
in D
RA
M)
–Lo
w la
tenc
y an
d re
ally
high
ban
dwid
th to
sup
port
vec
tors
•F
acto
r of
sev
eral
hun
dred
ove
r ty
pica
l DR
AM
-bas
ed d
irect
ory
Slid
e 32
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Ou
tlin
e
•S
o w
hat e
xact
ly is
a “v
ecto
r co
mpu
ter?
”•
Dis
pelli
ng m
yths
•V
ecto
r ad
vant
ages
•V
ecto
r di
sadv
anta
ges
•T
he n
ew fa
ce o
f vec
tor
proc
essi
ng•
Fu
ture
dir
ecti
on
s
Slid
e 33
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Mo
ore
’s L
aw
The
num
ber
of tr
ansi
stor
s pe
r ch
ip d
oubl
es e
very
18
mon
ths
Gor
don
Moo
re, "
Cra
mm
ing
Mor
e C
ompo
nent
s O
nto
Inte
grat
ed C
ircui
ts,”
Ele
ctro
nics
, Apr
il 19
, 196
5.
Fro
m In
tel w
eb s
ite.
Slid
e 34
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Den
sity
Has
Dri
ven
Per
form
ance
•M
oore
’s L
aw r
elat
es to
den
sity
•T
rans
isto
r sw
itchi
ng ti
me
~ pr
opor
tiona
lto
gate
leng
th–
feat
ure
size
X 0
.7
dens
ity X
2,
spee
d X
1.4
•H
owev
er…
–cl
ock
rate
has
bee
n in
crea
sing
fast
er th
an tr
ansi
stor
spe
ed,
due
to d
eepe
r pi
pelin
ing
(can
spe
ed u
p cl
ock
by u
sing
few
er lo
gic
leve
ls e
ach
cloc
k pe
riod)
–w
ith th
e ex
tra
tran
sist
ors,
can
per
form
mor
e co
mpl
icat
ed lo
gic
(can
exe
cute
mor
e in
stru
ctio
ns p
er c
lock
)
perf
orm
ance
has
bee
n sc
alin
g w
ith d
ensi
ty
Slid
e 35
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Clo
ck S
calin
g O
ver
Tim
e
0110100
1000
1000
0 1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
Clock Speed (MHz)
8FO
4
16F
O4
Inte
l Pro
cess
ors
375
FO
4
70 F
O4 85
FO
411
5 F
O4
230
FO
418
5 F
O4
175
FO
4
53 F
O4
40 F
O4
46 F
O4
25 F
O4
9 F
O4
15 F
O4
10μm
3 μm
1 μm
1.5 μ
m0.
8 μm
0.35
μm0.
18μm
0.13
μm0.
07μm
0.1 μ
m0.
05μm
0.03
5 μm
Sou
rce:
ISA
T L
ast C
lass
ical
Com
pute
r st
udy,
200
1
Slid
e 36
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Har
der
an
d H
ard
er t
o E
xtra
ct IL
P
•IP
C o
ver
the
year
s:–
In 1
991:
all
mai
nstr
eam
mic
ros
wer
e si
ngle
way
issu
e–
By
1993
: x86
, 68K
,Spa
rc, A
lpha
, Pow
erP
C a
nd P
A-R
ISC
wer
e al
l 2
(or
3) w
ay is
sue
–B
y 19
94:
Pow
erP
C, A
lpha
and
MIP
S w
ere
4 w
ay, a
nd s
o w
ere
Spa
rcan
d P
A-R
ISC
by
‘95
and
‘96.
–In
200
4: 4
to 6
way
•It’
s be
com
ing
muc
h ha
rder
to e
xplo
it ad
ditio
nal I
PC
Not
effe
ctiv
e to
kee
p bu
ildin
g fa
tter
sequ
entia
l pro
cess
ors
Slid
e 37
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Pen
tiu
m II
I vs.
Pen
tiu
m IV
2x10
810
8#
Grid
s
12K
Byt
es16
KB
ytes
L1 D
$ C
apac
ity
42 m
illio
n24
mill
ion
Tra
nsis
tor
Cou
nt
217m
m2
106m
m2
Die
Siz
e
0.35
0.45
Sp
ecIn
t/M
Hz
524
454
Spe
cInt
2000
1.5G
Hz
(10.
4 F
O4)
1GH
z (1
5 F
O4)
Clo
ck R
ate
2010
Pip
elin
e S
tage
s
180n
m18
0nm
Tec
hnol
ogy
Pen
tium
IVP
entiu
m II
I
Sou
rce:
ISA
T L
ast C
lass
ical
Com
pute
r st
udy,
200
1
Slid
e 38
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Th
e P
arty
’s O
ver
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980
1990
2000
2010
2020
Per
f (ps
/Inst
)
52%/year
19%/year
ps/gate19%
Gates/clock9%
Clocks/inst18%
Sou
rce:
ISA
T L
ast C
lass
ical
Com
pute
r st
udy,
200
1
Con
vent
iona
l pro
cess
ors
no lo
nger
sca
le p
erfo
rman
ce b
y 50
% e
ach
year
futu
re p
erfo
rman
ce in
crea
ses
will
requ
ire h
ighl
y pa
ralle
l on-
chip
arch
itect
ures
Slid
e 39
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Tre
nd
s in
th
e T
op
500
list
(to
p 2
0)
Aver
age
Num
ber o
f Pro
cess
ors
050
010
0015
0020
0025
0030
0035
00
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
(Jun
e)
•S
uper
com
putin
g is
bec
omin
g in
crea
sing
ly a
bout
scal
abili
ty
•E
xpec
t rat
eof
incr
ease
to in
crea
sesy
stem
s w
ith >
100
,000
proc
sin
the
next
dec
ade
Slid
e 40
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Flo
ps
are
Ch
eap
, C
om
mu
nic
atio
n is
Exp
ensi
ve
•C
ompu
tatio
n is
che
ap c
ompa
red
to d
ata
mov
emen
t–
In 0
.13u
m C
MO
S, a
64-
bit F
PU
is <
1m
m2
and
≅50
pJC
an fi
t ove
r 20
0 on
a $
200
14m
m x
14m
m 1
GH
z ch
ip
–If
fed
from
sm
all,
loca
l reg
iste
r fil
es (
3200
GB
/s, 1
0 pJ
/op)
< $
1/G
flop
(
60 m
W/G
flop)
–If
fed
from
glo
bal o
n-ch
ip m
emor
y (1
00 G
B/s
, 1nJ
/op)
~$3
0/G
flop
(1W
/Gflo
p)
–If
fed
from
off-
chip
mem
ory
(16
GB
/s)
~ $
200/
Gflo
p (
man
y W
/Gflo
p)
Mus
t inc
reas
e fr
actio
n of
ope
rand
s fr
om o
n-ch
ip m
emor
yM
ust i
ncre
ase
frac
tion
of o
pera
nds
from
loca
lreg
iste
rs
Slid
e 41
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Tec
hn
olo
gy
Co
ncl
usi
on
s
•S
ingl
e th
read
sca
lar
perf
orm
ance
impr
ovem
ent w
ill s
low
–ve
ctor
s w
ill r
emai
n a
good
way
to o
btai
n gr
eate
r pa
ralle
lism
w
ithin
a s
ingl
e th
read
–bu
t will
nee
d to
do
bette
r th
an ju
st w
ider
pip
es…
•R
ate
of s
cala
bilit
y in
crea
se w
ill in
crea
se–
need
to th
ink
abou
t 10’
s or
100
’s o
f tho
usan
ds o
f pro
cess
ors
•In
terc
onne
ct w
ill b
ecom
e in
crea
sing
ly im
port
ant
–ne
ed to
pro
vide
glo
bal b
andw
idth
cos
t effe
ctiv
ely
•U
sing
ban
dwid
th w
isel
y w
ill b
e ke
y–
impr
ove
conc
urre
ncy
to u
tiliz
e ba
ndw
idth
in th
e fa
ce o
f ↑la
tenc
y–
redu
ce b
andw
idth
dem
and
arch
itect
ural
ly
Slid
e 42
Sal
isha
n, 2
004
Cop
yrig
ht C
ray
Inc.
Fu
ture
Vec
tor
Dir
ecti
on
s
•H
ybrid
sca
lar/
vect
or s
yste
ms
–fo
r be
tter
tailo
ring
proc
esso
r to
app
licat
ion
•V
ecto
r pr
oces
sor
mic
roar
chite
ctur
e en
hanc
emen
ts–
supp
ort f
or a
rray
s of
str
ucts
–m
ore
adva
nced
dec
oupl
ing
•E
nhan
cing
tem
pora
l loc
ality
to r
educ
e ba
ndw
idth
dem
and
–ap
plyi
ng te
chni
ques
from
str
eam
ing
arch
itect
ures
–ex
pand
ed r
egis
ter
hier
arch
y vi
a cl
uste
red
vect
or a
rchi
tect
ures
•S
avin
g ba
ndw
idth
with
sm
arte
r m
emor
y–
enha
nced
ato
mic
mem
ory
oper
atio
ns–
light
wei
ght p
roce
ssor
s in
mem
ory
syst
em
•G
oing
afte
r IL
P, D
LP a
ndT
LP–
mul
tithr
eadi
ng v
ecto
r pr
oces
sors
(ne
ed m
uch
less
)–
see
MIT
SC
ALE
pro
ject
for
anot
her
appr
oach
DO
E C
on
fere
nce
on
Hig
h-S
pee
d C
om
pu
tin
g,
Ap
ril 2
004
Th
ank
Yo
u.
Qu
esti
on
s?