© gordon bell 1 nrc review panel on high performance computing 11 march 1994 gordon bell
Post on 31-Mar-2015
216 Views
Preview:
TRANSCRIPT
© Gordon Bell1
NRC Review Panel on High Performance Computing
11 March 1994
Gordon Bell
© Gordon Bell2
Position
Dual use: Exploit parallelism with in situ nodes & networksLeverage WS & mP industrial HW/SW/app infrastructure!
No Teraflop before its time -- its Moore's Law
It is possible to help fund computing: Heuristics from federal funding & use ( 50 computer systems and 30 years)
Stop Duel Use, genetic engineering of State Computers •10+ years: nil pay back, mono use, poor, & still to come •plan for apps porting to monos will also be ineffective -- apps must leverage, be cross-platform & self-sustaining•let "Challenges" choose apps, not mono use computers•"industry" offers better computers & these are jeopardized•users must be free to choose their computers, not funders •next generation State Computers "approach" industry •10 Tflop ... why?
Summary recommendations
© Gordon Bell3
Principle computingEnvironmentscirca 1994 -->4 networks tosupportmainframes,minis, UNIX servers,workstations &PCs
Token-ring (gateways, bridges,routers, hubs, etc.)
LANs
PCs (DOS, Windows, NT)
UNIX workstations
POTs netfor switching
terminals
ASCII & PCterminals
3270 (&PC)terminals
Ethernet (gateways, bridges,routers, hubs, etc.)
LANs
mainframes
minicomputers
Novell & NT servers
Xterminals
minicomputers
NFSservers
>4 Interconnect & comm. stds:
POTS & 3270 terms.
WAN (comm. stds.)
LAN (2 stds.)
Clusters (prop.)
Late '80s LAN-PCworld
'80s Unix distributed workstations & servers world
70's mini(prop.) world & '90s UNIXmini world
IBM & propritarymainframeworld '50s
Datacomm.worlds
mainframesASCII & PCterminals
clusters
Compute & dbase uni- & mP servers
clustersWide-areainter-sitenetwork
UNIX Multiprocessorservers operated as
traditional minicomputers
© Gordon Bell4
ComputingEnvironmentscirca 2000
Local & global data commworld
ATM† & LocalArea Networksfor: terminal,
PC, workstation,& servers
Centralized& departmental
uni- & mP servers(UNIX & NT)
Legacymainframes &
minicomputersservers & terms
Wide-areaglobal
ATM network
Legacymainframe &
minicomputerservers & terminals
Centralized& departmental
scalable uni- & mP servers*
(NT & UNIX)
NT, Windows& UNIX personservers
Platforms: X86PowerPC Sparcetc.Universal highspeed dataservice usingATM or ??
NT, Windows& UNIX person
servers*
*
multicomputers built from multiple simple, servers
NFS, database, compute, print, & communication servers
† also10 - 100 mb/spt-to-pt Ethernet
TC=TV+PChome ...
(CATV or ATM)
???
© Gordon Bell5
Beyond Dual & Duel Use Technology:Parallelism can & must be free!
HPCS, corporate R&D, and technical users must have the goal to design, install and support parallel environments using and leveraging:• every in situ workstation & multiprocessor server • as part of the local ... national network.
Parallelism is a capability that all computing environments can & must possess!--not a feature to segment "mono use" computers
Parallel applications become a way of computing utilizing existing, zero cost resources -- not subsidy for specialized ad hoc computers
Apps follow pervasive computing environments
© Gordon Bell6
Computer genetic engineering & species selection has been
ineffectiveAlthough Problem x Machine Scalability using SIMD for simulating some physical systems has been demonstrated, given extraordinary resources, the efficacy of larger problems to justify cost-effectiveness has not. Hamming:"The purpose of computing is insight, not numbers."
The "demand side" Challenge users have the problems and should be drivers. ARPA's contractors should re-evaluate their research in light of driving needs.
Federally funded "Challenge" apps porting should be to multiple platforms including workstations & compatible, multis that support // environments to insure portability and understand main line cost-effectiveness
Continued "supply side"programs aimed at designing, purchasing, supporting, sponsoring, & porting of apps to specialized, State Computers, including programs aimed at 10 Tflops, should be re-directed to networked computing.
User must be free to choose and buy any computer, including PCs & WSs, WS Clusters, multiprocessor servers, supercomputers, mainframes, and even highly distributed, coarse grain, data parallel, MPP State computers.
© Gordon Bell7
Performance (t)
20001998199619941992199019881
10
100
1000
10000
• Intel $55M
Intel $300M•
NEC
Cray Super $30M
Cray DARPA
CM5 $30M
CM5 $120M
CM5 $240M
The teraflops
Bell Prize •
© Gordon Bell8
We get no Teraflop before it's time: it's Moore's Law!
Flops = f(t,$), not f(t) technology plans e.g. BAA 94-08 ignores $s!
All Flops are not equal (peak announced performance-PAP or real app perf. -RAP)
FlopsCMOSPAP*< C x 1.6**(1992-t) x $; C = 128 x 10**6 flops / $30,000
FlopsRAP =FlopsPAP x 0.5 for real apps, 1/2 PAP is a great goal
Flopssupers = FlopsCMOS x 0.1; improvement of supers 15-40%/year; higher cost is f(need for profitability, lack of subsidies, volume, SRAM)
92'-94': FlopsPAP/$ =4K; Flopssupers/$=500; Flopsvsp/$ =50 M (1.6G@$25)
*Assumes primary & secondary memory size & costs scale with time memory = $50/MB in 1992-1994 violates Moore's Law disks = $1/MB in1993, size must continue to increases at 60% / year
When does a Teraflop arrive if only $30 million** is spent on a super?
1 TflopCMOS PAP in 1996 (x7.8) with 1 GFlop nodes!!!; or 1997 if RAP
10 TflopCMOS PAP will be reached in 2001 (x78) or 2002 if RAP
How do you get a teraflop earlier?
**A $60 - $240 million Ultracomputer reduces the time by 1.5 - 4.5 years.
© Gordon Bell9
Funding Heuristics(50 computers & 30 years of
hindsight)1. Demand side works i.e., we need this product/technology for x; Supply side doesn't work! Field of Dreams": build it and they will come.
2. Direct funding of university research resulting in technology and product prototypes that is carried over to startup a company is the most effective. -- provided the right person & team are backed with have a transfer avenue.a. Forest Baskett > Stanford to fund various projects (SGI, SUN, MIPS)b. Transfer to large companies has not been effectivec. Government labs... rare, an accident if something emerges
3. A demanding & tolerant customer or user who "buys" products works best to influence and evolve products (e.g., CDC, Cray, DEC, IBM, SGI, SUN) a. DOE labs have been effective buyers and influencers, "Fernbach policy"; unclear if labs are effective product or apps or process developers b. Universities were effective at influencing computing in timesharing, graphics, workstations, AI workstations, etc.c. ARPA, per se, and its contractors have not demonstrated a need for flops. d. Universities have failed ARPA in defining work that demands HPCS -- hence are unlikely to be very helpful as users in the trek to the teraflop.
4. Direct funding of large scale projects" is risky in outcome, long-term, training, and other effects. ARPAnet established an industry after it escaped BBN!
© Gordon Bell10
Funding Heuristics-2 5. Funding product development, targeted purchases, and other subsidies to
establish "State Companies"in a vibrant and overcrowded market is wasteful, likely to be wrong , likely to impede computer development, (e.g. by having to feed an overpopulated industry). Furthermore, it is likely to have a deleterious effect on a healthy industry (e.g. supercomputers).
A significantly smaller universe of computing environments is needed. Cray & IBM are given; SGI is probably the most profitable technical; HP/Convex are likely to be a contender, & others (e.g., DEC) are trying. No state co (intel,TMC, Tera) is likely to be profitable & hence self-sustaining.
6. "University-Company collaboration is a new area of government R&D. So far it hasn't worked nor is it likely to, unless the company invests. Appears to be a way to help company fund marginal people and projects.
7. CRADAs or co-operative research and development agreement are very closely allied to direct product development and are equally likely to be ineffective.
8. Direct subsidy of software apps or the porting of apps to one platform, e.g., EMI analysis are a way to keep marginal computers afloat.If government funds apps, they must be ported cross-platform!
9. Encourage the use of computers across the board, but discourage designs from those who have not used or built a successful computer.
© Gordon Bell11
Scalability: The Platform of HPCS& why continued funding is unnecessary
Mono use aka MPPs have been, are, and will be doomed
The law of scalability
Four scalabilities: machine, problem x machine, generation (t), & now spatial
How do flops, memory size, efficiency & time vary with problem size? Does insight increase with problem size?
What's the nature of problems & work for monos?
What about the mapping of problems onto monos?
What about the economics of software to support monos?
What about all the competitive machines? e.g. workstations, workstation clusters, supers, scalable multis, attached P?
© Gordon Bell12
Special, mono-use MPPs are doomed...no matter how much fedspend!
Special because it has non-standard nodes & networks -- with no appsHaving not evolved to become mainline -- events have over-taken them.
It's special purpose if it's only in Dongarra's Table 3. Flop rate, execution time, and memory size vs problem size shows limited applicability to very large scale problems that must be scaled to cover the inherent, high overhead.
Conjecture: a properly used supercomputer will provide greater insight and utility because of the apps and generality -- running more, smaller sized problems with a plan produces more insight
The problem domain is limited & now they have to compete with: •supers -- do scalars, fine grain, and work and have apps•workstations -- do very long grain, are in situ and have apps•workstation clusters -- have identical characteristics and have apps•low priced ($2 million) multis -- are superior i.e., shorter grain and have apps•scalable multiprocessors -- formed from multis are in design stage
Mono useful (>>//) -- hence, are illegal because they are not dual use Duel use -- only useful to keep a high budget in tact e.g., 10 TF
© Gordon Bell13
The Law of Massive Parallelism isbased on application scale
There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other problem.
A ... any parallel problem can be scaled to run on an arbitrary network of computers, given enough memory and time
Challenge to theoreticians: How well will an algorithm run?
Challenge for software: Can package be scalable & portable?
Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just flops?
Challenge to HPCC: Is the cost justified? if so let users do it!
© Gordon Bell14
Scalabilities
Size scalable computers are designed from a few components, with no bottleneck component.
Generation scalable computers can be implemented with the next generation technology with No rewrite/recompile
Problem x machine scalability - ability of a problem, algorithm, or program to exist at a range of sizes so that it can be run efficiently on a given, scalable computer.
Although large scale problems allow high flops, large probs running longer may not produce more insight.
Spatial scalability -- ability of a computer to be scaled over a large physical space to use in situ resources.
© Gordon Bell15
Linpack rate in Gflopsvs Matrix Order
1000001000010001001
10
100
SX 3 4 PSX 3 4 P
CM5 1KCM5 1K
???
© Gordon Bell16
Linpack Solution timevs Matrix Order
100,00010,0001,0001001
10
100
1000
SX 3 4 PSX 3 4 P
CM5 1KCM5 1K
© Gordon Bell17
GB's Estimate of Parallelism in Engineering & Scientific Applications
scalar60%
vector15%
mP (<8)vector
5%
>>// 5%
embarrassingly or perfectly parallel
15%
log (# of apps)
granularity & degree of coupling (comp./comm.)
new orscaled-up apps
dusty decksfor supers
SupersWSs massive mCs & WSs
----scalable multiprocessors-----
© Gordon Bell18
MPPs are only for unique,very large scale, data parallel
apps
s
Scalar| vector |vector mP| data // | emb. // | gp work | viz | apps
100...
10...1...
.1...
.01
mP
WS
mP
WS
s
s
s
mP
s
mP
WS
mP
WS
WS
mP
>>// >>//
Application characterization
$M
WS
s
monouse
© Gordon Bell19
Applicability of varioustechnical computer
alternativesDomain PC|WS Multi servr SC & Mfrm >>// WS Clusters
scalar 1 1 2 na 1*vector 2* 2 1 3 2vect.mP na 2 1 3 nadata // na 1 2 1 1ep & inf.// 1 2 3 2 1gp wrkld 3 1 1 na 2vizualiz'n 1 na na na 1apps 1 1 1 na from WS
*Current micros are weak, but improving rapidly such that subsequent >>//s that use them will have no advantage for node vectorization
© Gordon Bell20
Performance using distributedcomputers depends on
problem & machine granularity
Berkeley's log(p) model characterizes granularity &needs to be understood, measured, and used
Three parameters are given in terms of processing ops:
l = latency -- delay time to communicate between apps
o = overhead -- time lost transmitting messages
g = gap - 1 / message-passing rate ( bandwidth) - time between messages
p = number of processors
© Gordon Bell21
GranularityNomograph
C 90
MPPs
WANs&LANs
1G
100M
10M
100
1000
10K
100 K
1 M
10 M
100 ns.
1 µs.
10 µs.
100 µs.
1 ms. (LAN)
10 ms.
100 ms. (WAN)
1sec.
Processor Processor speedspeed
Grain length Grain length (ops)(ops)
Grain Grain Comm. Comm. Latency & Latency & Synch. Synch. Ovhd.Ovhd.
Fine
Med.
Coarse
Very
1993µ
1995µ
1993
Ultra
x
© Gordon Bell22
GranularityGranularityNomographNomograph
Cray T3D
C 90
VPP 500
VP
1G
100M
10M
100
1000
10K
100 K
1 M
10 M
100 ns.(Supers mem.)
1 µs.
10 µs.
100 µs.
1 ms. (LAN)
10 ms.
100 ms. (WAN)
1sec.
Processor speed
Grain length (ops)
Grain Comm. Latency & Synch. Ovhd.
Fine
Med.
Coarse
Very
1993µ
1995µ
1993super
Ultra
x
© Gordon Bell23
Economics of Packaged Software
Platform Cost Leverage # copies
MPP >100K 1 1-10 copies
Minis, mainframe 10-100K 10-100 1000s copiesalso, evolving high performance multiprocessor servers
Workstation 1-100K 1-10K 1-100K copies
PC 25-500 50K-1M 1-10M copies
© Gordon Bell24
Chuck Seitz comments
on multicomputers“I believe that the commercial, medium grained multicomputers aimed at ultra-supercomputer performance have adopted a relatively unprofitable scaling track, and are doomed to extinction. ... they may as Gordon Bell believes be displaced over the next several years by shared memory multiprocessors. ... For loosely coupled computations at which they excel, ultra-super multicomputers will, in any case, be more economically implemented as networks of high-performance workstations connected by high-bandwidth, local area networks...”
© Gordon Bell25
Convergence to a single architecturewith a single address space
that uses a distributed, shared memory
limited (<20) scalability multiprocessors >> scalable multiprocessors
workstations with 1-4 processors>> workstation clusters & scalable multiprocessors
workstation clusters >> scalable multiprocessors
State Computers built as message passing multicomputers >> scalable multiprocessors
© Gordon Bell26
Convergence to one architecture
limited scalability: mP, uniform memory access
experimental, scalable, multicomputer: smC, non uniform memory access
1st smC hypercube Transputer
(grid)
smC fine-grain
DSM??
smC med-coarse
grain
mP mainframe,
super
smC next gen.
DSM=>smP
Mosaic-C, J-machine
Fujitsu, Intel, Meiko, NCUBE, TMC; 1985-1994
Convex, Cray, Fujistu, IBM, Hitachi, NEC mainframes & supers
smC coarse gr.
clusters
smC: very
coarse grain
Cm* ('75), Butterfly ('85), Cedar ('88)
mP bus based
multi: mini, W/S
networked workstations: smC
mP ring-based
multi
Cosmic Cube, iPSC 1, NCUBE, Transputer-based
Apollo, SUN, HP, etc.
scalable, mP: smP, non-uniform memory access 1st smP
0 cache
smP DSM some cache
smP all cache
arch.
DASH, Convex, Cray T3D, SCI
KSR Allcache next gen. smP research e.g. DDM, DASH+
WSs Clusters via special switches 1994 & ATM 1995
micros
1995?Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers
DEC, Encore, Sequent, Stratus, SGI, SUN, etc.
??
high bandwith switch , comm. protocols e.g. ATM
Natural evolution
Cache for locality
WS Micros, fast switch
1995?
1995?
note, only two structures: 1. shared memory mP with uniform & non-uniform memory access; and 2. networked workstations, shared nothingmPs continue
to be the main line
© Gordon Bell27
Re-engineering HPCSGenetic engineering of computers has not produced a healthy strain that lives
more than one, 3 year computer generation. Hence no app base can form. •No inter-generational, MPPs exist with compatible networks & nodes. •All parts of an architecture must scale from generation to generation! •An archecture must be designed for at least three, 3 year generations!
High price to support a DARPA U. to learn computer design -- the market is only $200 million and R&D is billions-- competition works far better
Inevitable movement of standard networks and nodes can or need not be accelerated, these best evolve by a normal market mechanism through driven by users
Dual use of Networks & Nodes is the path to widescale parallelism, not weird computers
Networking is free via ATM Nodes are free via in situ workstationsApps follow pervasive computing environments
Applicability was small and getting smaller very fast with many experienced computer companies entering the market with fine products e.g. Convex/HP, Cray, DEC, IBM, SGI & SUN that are leveraging their R&D, apps, apps, & apps
Japan has a strong supercomputer industry. The more we jeprodize ours by mandating use of weird machines that take away from use, the weaker it becomes.
MPP won, mainstream vendors have adopted multiple CMOS. Stop funding!environments & apps are needed, but are unlikely because the market is small
© Gordon Bell28
Recommendations to HPCSGoal: By 2000, massive parallelism must exist as a by-products that leverages a
widescale national network & workstation/multi HW/SW nodes
Dual use not duel use of products and technology or the principle of "elegance" -one part serves more than one function network companies supply networks, node suppliers use ordinary workstations/servers with existing apps will leverage $30 billion x 10**6 R&D
Fund high speed, low latency, networks for a ubiquitous service as the base of all forms of interconnections from WANs to supercomputers (in addition, some special networks will exist for small grain probs)
Observe heuristics in future federal program funding scenarios ... eliminate direct or indirect product development and mono-use computersFund Challenges who in turn fund purchase, not product development
Funding or purchase of apps porting must be driven by Challenges, but builds on binary compatible workstation/server apps to leverage nodes be cross-platform based to benefit multiple vendors & have cross-platform use
Review effectiveness of State Computers e.g., need, economics, efficacyEach committee member might visit 2-5 sites using a >>// computer
Review // program environments & the efficacy to produce & support apps
Eliminate all forms of State Computers & recommend a balanced HPCS program: nodes & networks; based on industrial infrastructurestop funding the development of mono computers, including the 10Tflopit must be acceptable & encouraged to buy any computer for any contract
Gratis advice for HPCC* & BS* D. Bailey warns that scientists have almost lost credibility....
Focus on Gigabit NREN with low overhead connections that will enable multicomputers as a by-product
Provide many small, scalable computers vs large, centralized
Encourage (revert to) & support not so grand challenges
Grand Challenges (GCs) need explicit goals & plans --disciplines fund & manage (demand side)... HPCC will not
Fund balanced machines/efforts; stop starting Viet Nams
Drop the funding & directed purchase of state computers
Revert to university research -> company & product development
Review the HPCC & GCs program's output ...
*High Performance Cash Conscriptor; Big Spenders
© Gordon Bell30
Disclaimer
This talk may appear inflammatory... i.e. the speaker may have appeared "to flame".
It is not the speaker's intent to make ad hominem attacks on people, organizations, countries, or computers ... it just may appear that way.
© Gordon Bell31
Scalability: The Platform of HPCS
The law of scalability
Three kinds: machine, problem x machine, & generation (t)
How do flops, memory size, efficiency & time vary with problem size?
What's the nature problems & work for the computers?
What about the mapping of problems onto the machines?
top related