1 scaleable windowsnt? jim gray microsoft research [email protected] gray

48
1 Scaleable WindowsNT? Jim Gray Microsoft Research [email protected] http://research.Microsoft.com /~Gray

Upload: esther-atkins

Post on 12-Jan-2016

236 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

1

Scaleable WindowsNT?

• Jim GrayMicrosoft Research [email protected]://research.Microsoft.com/~Gray

Page 2: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

2

Outline

• What is Scalability?

• Why does Microsoft care about ScaleUp

• Current ScaleUp Status?

• NT5 & SQL7 & Exchange

Page 3: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

Scale Up and Scale Out

SMPSMPSuper ServerSuper Server

DepartmentalDepartmentalServerServer

PersonalPersonalSystemSystem

Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard

Grow Out with ClusterGrow Out with Cluster

Cluster has inexpensive partsCluster has inexpensive parts

Clusterof PCs

Page 4: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

Billions Of Clients

• Every device will be “intelligent”

• Doors, rooms, cars…

• Computing will be ubiquitous

Page 5: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

Billions Of ClientsNeed Millions Of Servers

MobileMobileclientsclients

FixedFixedclients clients

ServerServer

SuperSuperserverserver

ClientsClients

ServersServers

All clients networked All clients networked to serversto servers May be nomadicMay be nomadic

or on-demandor on-demand Fast clients wantFast clients want

fasterfaster servers servers Servers provide Servers provide

Shared DataShared Data ControlControl CoordinationCoordination CommunicationCommunication

Page 6: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

ThesisMany little beat few big

Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?

$1 $1 millionmillion $100 K$100 K $10 K$10 K

MainframeMainframe MiniMiniMicroMicro NanoNano

14"14"9"9"

5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP

101066 clocks to bulk ram clocks to bulk ram

Event-horizon on chipEvent-horizon on chip

VM reincarnatedVM reincarnated

Multiprogram cache,Multiprogram cache,On-Chip SMPOn-Chip SMP

10 microsecond ram

10 millisecond disc

10 second tape archive

10 nano-second ram

Pico Processor

10 pico-second ram

1 MM 3

100 TB

1 TB

10 GB

1 MB

100 MB

Page 7: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

7

Outline

• What is Scalability

• Why does Microsoft care about ScaleUp

• Current ScaleUp Status?

• NT5 & SQL7 & Exchange

Page 8: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

8

Scalability1 billion 1 billion

transactionstransactions

1.8 million 1.8 million mail messagesmail messages

4 terabytes of 4 terabytes of datadata

100 million100 millionweb hitsweb hits

• Scale up: to large SMP nodesScale up: to large SMP nodes• Scale out: to clusters of SMP nodesScale out: to clusters of SMP nodes

Page 9: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

9

“Commercial” NT Clusters

• 16-node Tandem Cluster» 64 cpus

» 2 TB of disk

» Decision support

• 45-node Compaq Cluster» 140 cpus

» 14 GB DRAM

» 4 TB RAID disk

» OLTP (Debit Credit)

• 1 B tpd (14 k tps)

Page 10: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

10

Tandem Oracle/NT

• 27,383 tpmC

• 71.50 $/tpmC

• 4 x 6 cpus

• 384 disks=2.7 TB

Page 11: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

11

24 cpu, 384 disks (=2.7TB)

Page 12: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

Billion Transactions per Day Project

• Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks

• All off-the-shelf parts

• Using SQL Server & DTC distributed transactions

• DebitCredit Transaction

• Each node has 1/20 th of the DB

• Each node does 1/20 th of the work

• 15% of the transactions are “distributed”

Page 13: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

13

Type nodes CPUs DRAM ctlrs disks RAIDspace

WorkflowMTS

20CompaqProliant

2500

20x

2

20x

128

20x

1

20x

1

20x

2 GB

SQL Server

20CompaqProliant

5000

20x

4

20x

512

20x

4

20x36x4.2GB7x9.1GB

20x

130 GB

DistributedTransactionCoordinator

5CompaqProliant

5000

5x

4

5x

256

5x

1

5x

3

5x

8 GB

TOTAL 45 140 13 GB 105 895 3 TB

Billion Transactions Per Day Hardware

• 45 nodes (Compaq Proliant)

• Clustered with 100 Mbps Switched Ethernet

• 140 cpu, 13 GB, 3 TB.

Page 14: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

14

Millions of Transactions Per Day

0.1

1.

10.

100.

1,000.

1 Btpd Visa ATT BofA NYSE

Mtp

d

Millions of Transactions Per Day

0.100.200.300.400.500.600.700.800.900.

1,000.

1 Btpd Visa ATT BofA NYSE

Mtp

d

How Much Is 1 Billion Tpd?• 1 billion tpd = 11,574 tps

~ 700,000 tpm (transactions/minute)• ATT

» 185 million calls per peak day (worldwide)

• Visa ~20 million tpd» 400 million customers» 250K ATMs worldwide» 7 billion transactions

(card+cheque) in 1994

• New York Stock Exchange » 600,000 tpd

• Bank of America» 20 million tpd checks cleared

(more than any other bank)» 1.4 million tpd ATM transactions

• Worldwide Airlines Reservations: 250 Mtpd

Page 15: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

15

All ShippingAll ShippingProducts!Products!

Per Sec Per Sec Per Min Per DayPer Min Per Day

10K TPC 166 10,000 14,400,00010K TPC 166 10,000 14,400,000

1 BTPD 11,5741 BTPD 11,574 694,444 694,444 1,000,000,0001,000,000,000

1.4 BTPD 16,204 1.4 BTPD 16,204 972,222972,222 1,400,000,000 1,400,000,000

SQLSQLSQLSQLSQLSQLSQLSQLSQLSQLSQLSQL

COM / ActiveXCOM / ActiveX

MTSMTS IISIIS

Infinite, Ubiquitous ScalingRedefining the rules

Page 16: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

16

Microsoft.com: ~150x4 nodes

(3)

SwitchedEthernet

SwitchedEthernet

www.microsoft.com(3)

search.microsoft.com(1)

premium.microsoft.com(1)

European Data Center

FTPDownload Server

(1)

SQL SERVERS(2)

Router

msid.msn.com(1)

MOSWestAdmin LAN

SQLNetFeeder LAN

FDDI Ring(MIS4)

Router

www.microsoft.com(5)

Building 11

Live SQL Server

Router

home.microsoft.com(5)

FDDI Ring(MIS2)

www.microsoft.com(4)

activex.microsoft.com(2)

search.microsoft.com(3)

register.microsoft.com(2)

msid.msn.com(1)

FDDI Ring(MIS3)

www.microsoft.com(3)

premium.microsoft.com(1)

msid.msn.com(1)

FDDI Ring(MIS1)

www.microsoft.com(4)

premium.microsoft.com(2)

register.microsoft.com(2)

msid.msn.com(1) Primary

Gigaswitch

SecondaryGigaswitch

Staging Servers(7)

search.microsoft.com

support.microsoft.com(2)

register.msn.com(2)

MOSWest

DMZ Staging Servers

premium.microsoft.com(1)

HTTPDownload Servers

(2) Router

search.microsoft.com(2)

SQL SERVERS(2)

msid.msn.com(1)

FTPDownload Server

(1)Router

Router

Router

Router

Router

Router

Router

Router

Internal WWW

SQL Reporting

home.microsoft.com(4)

home.microsoft.com(3)

home.microsoft.com(2)

register.microsoft.com(1)

support.microsoft.com(1)

Internet

13DS3

(45 Mb/Sec Each)

2OC3

(100Mb/Sec Each)

2Ethernet

(100 Mb/Sec Each)

cdm.microsoft.com(1)

FTP Servers

DownloadReplication

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $83KFY98 Fcst: 12

Ave CFG: 4xP5,256 RAM,12 GB HD

Ave CFG: 4xP6,512 RAM,30 GB HD

Ave CFG: 4xP6,512 RAM,50 GB HD

Ave CFG: 4xP6,512 RAM,30 GB HD

Ave CFG: 4xP6512 RAM28 GB HD

Ave CFG: 4xP6,256 RAM,30 GB HDAve Cost: $25KFY98 Fcst: 2

Ave CFG: 4xP6,512 RAM,30 GB HD

Ave CFG: 4xP6,512 RAM,50 GB HD

Ave CFG: 4xP5,512 RAM,30 GB HD

Ave CFG: 4xP6,512 RAM,160 GB HD

Ave CFG: 4xP6,

Ave CFG: 4xP5,512 RAM,30 GB HD

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 7

Ave CFG: 4xP5,256 RAM,20 GB HD

Ave CFG: 4xP6,512 RAM,30 GB HD

Ave CFG: 4xP6,512 RAM,50 GB HD

Ave CFG: 4xP6,512 RAM,160 GB HD

Ave CFG: 4xP6,512 RAM,160 GB HD

FTP.microsoft.com(3)

Ave CFG: 4xP5,512 RAM,30 GB HD

Ave CFG: 4xP6,512 RAM,30 GB HD

Ave CFG: 4xP6,512 RAM,30 GB HD

Ave CFG: 4xP6,1 GB RAM,160 GB HDAve Cost: $83KFY98 Fcst: 2

IDC Staging Servers

Live SQL Servers

SQL Consolidators

Japan Data Center

Internet

Internet

www.microsoft.com(3)

Ave CFG: 4xP6,512 RAM,50 GB HD

Page 17: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

17

NCSA Super Cluster

• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana

• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model

http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html

Page 18: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

18

TPC C Improved Fast(250%/year!)

1.52.755676

$/tpmC vs time

$10

$100

$1,000

Jan-93 Jun-94 Oct-95 Mar-97 Jul-98

Date

$/tp

mC

250 %/year improvement!

tpmC vs time

100

1,000

10,000

100,000

Jan-93 Jun-94 Oct-95 Mar-97 Jul-98

Date

tpm

C 250 %/year improvement!

40% hardware, 40% hardware, 100% software, 100% software,

100% PC Technology100% PC Technology

Page 19: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

19

Windows NT Versus UNIXtpmC vs Time

05,000

10,00015,00020,00025,00030,00035,000

Jan-95 Jan-96 Jan-97

tpm

C

h Unix NT

Page 20: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

20

Economy Of Scale

Transactions/k$ By Vendor

0.0

5.0

10.0

15.0

20.0

25.0

0 10,000 20,000 30,000 40,000

tpmC

tpm

C/k

$

DB2/Unix

Sybase/Unix

Informix/Unix

Microsoft/NT

Oracle/Unix

Page 21: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

21

Microsoft TerraServer: Scaleup to Big Databases

• Build a 1 TB SQL Server database• Data must be

» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere

• Loaded » 1.5 M place names from Encarta World Atlas» 3 M Sq Km from USGS (1 meter resolution)» 1 M Sq Km from Russian Space agency (2 m)

• On the web (world’s largest atlas)• Sell images with commerce server.

Page 22: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

22

Microsoft TerraServer Background

• Earth is 500 Tera-meters square» USA is 10 tm2

• 100 TM2 land in 70ºN to 70ºS

• We have pictures of 6% of it» 3 tsm from USGS

» 2 tsm from Russian Space Agency

• Compress 5:1 (JPEG) to 1.5 TB.

• Slice into 10 KB chunks

• Store chunks in DB

• Navigate with

» Encarta™ Atlas• globe

• gazetteer

» StreetsPlus™ in the USA

40x60 km2 jump image

20x30 km2 browse image

10x15 km2 thumbnail

1.8x1.2 km2

tile

• Someday» multi-spectral image

» of everywhere

» once a day / hour

Page 23: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

23

Demo • navigate by coverage map to White House

• Download image

• buy imagery from USGS

• navigate by name to Venice

• buy SPIN2 image & Kodak photo

• Pop out to Expedia street map of Venice

• Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)

Page 24: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

24

The Microsoft TerraServer Hardware

• Compaq AlphaServer 8400

• 8x400Mhz Alpha cpus

• 10 GB DRAM

• 324 9.2 GB StorageWorks Disks» 3 TB raw, 2.4 TB of RAID5

• STK 9710 tape robot (4 TB)

• WindowsNT 4 EE, SQL Server 7.0

Page 25: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

25

browser

HTMLJava

Viewer

The Internet

Web Client

Microsoft AutomapActiveX Server

Internet InfoServer 4.0

Image DeliveryApplication

SQL Server7

MicrosoftSite Server EE

Internet InformationServer 4.0

Image Provider Site(s)

TerraServer DB Automap Server

Terra-ServerStored Procedures

InternetInformationServer 4.0

ImageServer

Active Server Pages

MTS

TerraServer Web Site

Software

SQL Server 7

Page 26: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

26

Image Delivery and LoadIncremental load of 4 more TB in next 18 months

DLTTape “tar”

\Drop’N’ DoJobWait 4Load

LoadMgrDB

100mbitEtherSwitch

108 9.1 GBDrives

Enterprise Storage Array

AlphaServer8400

108 9.1 GBDrives

108 9.1 GBDrives

STKDLTTape

Library

604.3 GBDrives

AlphaServer4100

ESAAlphaServer4100

LoadMgr

DLTTape

NTBackup

ImgCutter

\Drop’N’ \Images

10: ImgCutter20: Partition30: ThumbImg40: BrowseImg45: JumpImg50: TileImg55: Meta Data60: Tile Meta70: Img Meta80: Update Place

...LoadMgr

Page 27: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

27

71 Total Average PeakHits 728.45m 10.26m 29.27m

Queries 565.09m 7.96m 17.76mImages 212.02m 2.99m 9.23m

PageViews 376.29m 5.30m 9.20m

TerraServer: A Real “World” Example

• Largest DB on the Web

• 1.3TB

• 99.95% uptime since July 1

• No downtime, period, in August

• 70% of downtime for SQL software upgrades

Page 28: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

28

NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE

• Scale UP an SMP: TerraServer

• Scale OUT with a cluster of machines

• Single-system image

»Naming

»Protection/security

»Management/load balance

• Fault tolerance

»“Wolfpack”

• Hot pluggable hardware & software

Page 29: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

29

Web Web sitesite

DatabaseDatabase

Web site filesWeb site files

Database filesDatabase files

Server 1Server 1

BrowserBrowser

Symmetric Virtual Server Failover Example

Server 1Server 1 Server 2Server 2

Web site filesWeb site files

Database filesDatabase files

Web Web sitesite

DatabaseDatabase

Web Web sitesite

DatabaseDatabase

Page 30: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

30

Windows NT 5 (scalability features)

• Better SMP support• Clusters:

»16x packs (fault tolerant clusters)» 100x mobs: arrays for manageability»SAN/VIA support

• 64 bit addressing for data»Apps like SQL, Oracle, will use it for data»64 bit API to NT comes later (in lab now).

• Remote management (scripting and DCOM)• Active Directory• Veritas volume manager • Many 3rd party HSMs• Batch support

Page 31: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

31

Microsoft SQL Server 7.0• Fixes the famous performance bugs

»dynamic record locking

»online backup, quick recovery….

• 64 bit addressing buffer pool

• SMP parallelism and better SMP support

• Built in OLAP (cubes and MOLAP)

• Scale down to Win9x

• Improved management interfaces

• Data transform services (for warehouses)

Page 32: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

32

Outline

• What is Scalability

• Why does Microsoft care about ScaleUp

• Current ScaleUp Status?

• NT5 & SQL7

Page 33: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

33

end

Other slides would be interesting, but...

Page 34: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

34

Interesting “other slides”No time for them but...

• How much information is there?

• IO bandwidth in the Intel world

• Intelligent disks

• SAN/VIA

• NT Cluster Sort

Page 35: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

35

Some Tera-Byte DatabasesKilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

• The Web: 1 TB of HTML

• TerraServer 1 TB of images

• Several other 1 TB (file) servers

• Hotmail: 7 TB of email

• Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked

• EOS/DIS (picture of planet each week)» 15 PB by 2007

• Federal Clearing house: images of checks» 15 PB by 2006 (7 year history)

• Nuclear Stockpile Stewardship Program» 10 Exabytes (???!!)

Page 36: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

36

Library of Congress (text)

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

A novel

A letter

All Disks

All Tapes

A Movie

LoC (image)

Info Capture• You can record everything you see or hear or read.

• What would you do with it?

• How would you organize & analyze it?

Video 8 PB per lifetime (10GBph)Audio 30 TB (10KBps) Read or write: 8 GB (words)

See: http://www.lesk.com/mlesk/ksg97/ksg.html

Page 37: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

37

Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html

• Soon everything can be recorded and kept

• Most data will never be seen by humans

• Precious Resource: Human attention Auto-SummarizationAuto-Search

will be a key enabling technology.

Page 38: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

38

PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point)

System Bus422 MBps

7.2 MB/s

133 MBps7.2 MB/s

10-15 MBps7.2 MB/s

SCSIFile System Buffers

ApplicationData

Disk

PCI

40 MBps7.2 MB/s

Page 39: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

39

PAP vs RAP• Reads are easy, writes are hard

• Async write can match WCE.

422 MBps

142 MBps

133 MBps

72 MBps

10-15 MBps

9 MBps

SCSI

File System

ApplicationData

PCI SCSI

Disks40 MBps

31 MBps

Page 40: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

40

Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI

(not measured, we had only one PCI bus available, 2nd one was “internal”)

~ 120 MBps Unbuffered read

~ 80 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

PCI

Adapter

Adapter

Adapter

120

MB

ps

Page 41: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

41

Year 2002 Disks• Big disk (10 $/GB)

» 3”

» 100 GB

» 150 kaps (k accesses per second)

» 20 MBps sequential

• Small disk (20 $/GB)» 3”

» 4 GB

» 100 kaps

» 10 MBps sequential

• Both running Windows NT™ 7.0?(see below for why)

Page 42: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

42

How Do They Talk to Each Other?• Each node has an OS

• Each node has local resources: A federation.

• Each node does not completely trust the others.

• Nodes use RPC to talk to each other» CORBA? DCOM? IIOP? RMI?

» One or all of the above.

• Huge leverage in high-level interfaces.

• Same old distributed system story.

Wire(s)h

stre

ams

data

gram

s

RP

C?

Applications

VIAL/VIPL

streams

datagrams

RP

C ?

Applications

Page 43: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

43

Gbps Ethernet: 110 MBps

SAN: Standard

Interconnect

PCI 32: 70 MBps

UW Scsi: 40 MBps

FW scsi: 20 MBps

scsi: 5 MBps

• LAN faster than memory bus?

• 1 GBps links in lab.

• 300$ port cost soon

• Port is computer

RIPFDDI

RIPATM

RIPSCI

RIPSCSI

RIPFC

RIP?

Page 44: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

44

PennySort• Hardware

» 266 Mhz Intel PPro

» 64 MB SDRAM (10ns)

» Dual Fujitsu DMA 3.2GB EIDE

• Software» NT workstation 4.3

» NT 5 sort

• Performance» sort 15 M 100-byte records (~1.5 GB)

» Disk to disk

» elapsed time 820 sec • cpu time = 404 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

Page 45: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

45

Cluster Sort Conceptual Model

•Multiple Data Sources

•Multiple Data Destinations

•Multiple nodes

•Disks -> Sockets -> Disk -> DiskB

AAABBBCCC

A

AAABBBCCC

C

AAABBBCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

Page 46: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

46

Cluster Install & Execute

•If this is to be used by others, it must be:

•Easy to install•Easy to execute

• Installations of distributed systems take time and can be tedious. (AM2, GluGuard)

• Parallel Remote execution is non-trivial. (GLUnix, LSF)

How do we keep this “simple” and “built-in” to NTClusterSort ?

Page 47: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

47

Remote Install

RegConnectRegistry()

RegCreateKeyEx()

•Add Registry entry to each remote node.

Page 48: 1 Scaleable WindowsNT? Jim Gray Microsoft Research Gray@Microsoft.com Gray

48

Cluster Execution

MULT_QI COSERVERINFO•Setup :

MULTI_QI structCOSERVERINFO struct

•CoCreateInstanceEx()

•Retrieve remote object handle from MULTI_QI struct

•Invoke methods as usual

HANDLEHANDLE

HANDLE

Sort()

Sort()

Sort()