1 scaleable windowsnt? jim gray microsoft research [email protected] gray
TRANSCRIPT
1
Scaleable WindowsNT?
• Jim GrayMicrosoft Research [email protected]://research.Microsoft.com/~Gray
2
Outline
• What is Scalability?
• Why does Microsoft care about ScaleUp
• Current ScaleUp Status?
• NT5 & SQL7 & Exchange
Scale Up and Scale Out
SMPSMPSuper ServerSuper Server
DepartmentalDepartmentalServerServer
PersonalPersonalSystemSystem
Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard
Grow Out with ClusterGrow Out with Cluster
Cluster has inexpensive partsCluster has inexpensive parts
Clusterof PCs
Billions Of Clients
• Every device will be “intelligent”
• Doors, rooms, cars…
• Computing will be ubiquitous
Billions Of ClientsNeed Millions Of Servers
MobileMobileclientsclients
FixedFixedclients clients
ServerServer
SuperSuperserverserver
ClientsClients
ServersServers
All clients networked All clients networked to serversto servers May be nomadicMay be nomadic
or on-demandor on-demand Fast clients wantFast clients want
fasterfaster servers servers Servers provide Servers provide
Shared DataShared Data ControlControl CoordinationCoordination CommunicationCommunication
ThesisMany little beat few big
Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?
$1 $1 millionmillion $100 K$100 K $10 K$10 K
MainframeMainframe MiniMiniMicroMicro NanoNano
14"14"9"9"
5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP
101066 clocks to bulk ram clocks to bulk ram
Event-horizon on chipEvent-horizon on chip
VM reincarnatedVM reincarnated
Multiprogram cache,Multiprogram cache,On-Chip SMPOn-Chip SMP
10 microsecond ram
10 millisecond disc
10 second tape archive
10 nano-second ram
Pico Processor
10 pico-second ram
1 MM 3
100 TB
1 TB
10 GB
1 MB
100 MB
7
Outline
• What is Scalability
• Why does Microsoft care about ScaleUp
• Current ScaleUp Status?
• NT5 & SQL7 & Exchange
8
Scalability1 billion 1 billion
transactionstransactions
1.8 million 1.8 million mail messagesmail messages
4 terabytes of 4 terabytes of datadata
100 million100 millionweb hitsweb hits
• Scale up: to large SMP nodesScale up: to large SMP nodes• Scale out: to clusters of SMP nodesScale out: to clusters of SMP nodes
9
“Commercial” NT Clusters
• 16-node Tandem Cluster» 64 cpus
» 2 TB of disk
» Decision support
• 45-node Compaq Cluster» 140 cpus
» 14 GB DRAM
» 4 TB RAID disk
» OLTP (Debit Credit)
• 1 B tpd (14 k tps)
10
Tandem Oracle/NT
• 27,383 tpmC
• 71.50 $/tpmC
• 4 x 6 cpus
• 384 disks=2.7 TB
11
24 cpu, 384 disks (=2.7TB)
Billion Transactions per Day Project
• Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks
• All off-the-shelf parts
• Using SQL Server & DTC distributed transactions
• DebitCredit Transaction
• Each node has 1/20 th of the DB
• Each node does 1/20 th of the work
• 15% of the transactions are “distributed”
13
Type nodes CPUs DRAM ctlrs disks RAIDspace
WorkflowMTS
20CompaqProliant
2500
20x
2
20x
128
20x
1
20x
1
20x
2 GB
SQL Server
20CompaqProliant
5000
20x
4
20x
512
20x
4
20x36x4.2GB7x9.1GB
20x
130 GB
DistributedTransactionCoordinator
5CompaqProliant
5000
5x
4
5x
256
5x
1
5x
3
5x
8 GB
TOTAL 45 140 13 GB 105 895 3 TB
Billion Transactions Per Day Hardware
• 45 nodes (Compaq Proliant)
• Clustered with 100 Mbps Switched Ethernet
• 140 cpu, 13 GB, 3 TB.
14
Millions of Transactions Per Day
0.1
1.
10.
100.
1,000.
1 Btpd Visa ATT BofA NYSE
Mtp
d
Millions of Transactions Per Day
0.100.200.300.400.500.600.700.800.900.
1,000.
1 Btpd Visa ATT BofA NYSE
Mtp
d
How Much Is 1 Billion Tpd?• 1 billion tpd = 11,574 tps
~ 700,000 tpm (transactions/minute)• ATT
» 185 million calls per peak day (worldwide)
• Visa ~20 million tpd» 400 million customers» 250K ATMs worldwide» 7 billion transactions
(card+cheque) in 1994
• New York Stock Exchange » 600,000 tpd
• Bank of America» 20 million tpd checks cleared
(more than any other bank)» 1.4 million tpd ATM transactions
• Worldwide Airlines Reservations: 250 Mtpd
15
All ShippingAll ShippingProducts!Products!
Per Sec Per Sec Per Min Per DayPer Min Per Day
10K TPC 166 10,000 14,400,00010K TPC 166 10,000 14,400,000
1 BTPD 11,5741 BTPD 11,574 694,444 694,444 1,000,000,0001,000,000,000
1.4 BTPD 16,204 1.4 BTPD 16,204 972,222972,222 1,400,000,000 1,400,000,000
SQLSQLSQLSQLSQLSQLSQLSQLSQLSQLSQLSQL
COM / ActiveXCOM / ActiveX
MTSMTS IISIIS
Infinite, Ubiquitous ScalingRedefining the rules
16
Microsoft.com: ~150x4 nodes
(3)
SwitchedEthernet
SwitchedEthernet
www.microsoft.com(3)
search.microsoft.com(1)
premium.microsoft.com(1)
European Data Center
FTPDownload Server
(1)
SQL SERVERS(2)
Router
msid.msn.com(1)
MOSWestAdmin LAN
SQLNetFeeder LAN
FDDI Ring(MIS4)
Router
www.microsoft.com(5)
Building 11
Live SQL Server
Router
home.microsoft.com(5)
FDDI Ring(MIS2)
www.microsoft.com(4)
activex.microsoft.com(2)
search.microsoft.com(3)
register.microsoft.com(2)
msid.msn.com(1)
FDDI Ring(MIS3)
www.microsoft.com(3)
premium.microsoft.com(1)
msid.msn.com(1)
FDDI Ring(MIS1)
www.microsoft.com(4)
premium.microsoft.com(2)
register.microsoft.com(2)
msid.msn.com(1) Primary
Gigaswitch
SecondaryGigaswitch
Staging Servers(7)
search.microsoft.com
support.microsoft.com(2)
register.msn.com(2)
MOSWest
DMZ Staging Servers
premium.microsoft.com(1)
HTTPDownload Servers
(2) Router
search.microsoft.com(2)
SQL SERVERS(2)
msid.msn.com(1)
FTPDownload Server
(1)Router
Router
Router
Router
Router
Router
Router
Router
Internal WWW
SQL Reporting
home.microsoft.com(4)
home.microsoft.com(3)
home.microsoft.com(2)
register.microsoft.com(1)
support.microsoft.com(1)
Internet
13DS3
(45 Mb/Sec Each)
2OC3
(100Mb/Sec Each)
2Ethernet
(100 Mb/Sec Each)
cdm.microsoft.com(1)
FTP Servers
DownloadReplication
Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $83KFY98 Fcst: 12
Ave CFG: 4xP5,256 RAM,12 GB HD
Ave CFG: 4xP6,512 RAM,30 GB HD
Ave CFG: 4xP6,512 RAM,50 GB HD
Ave CFG: 4xP6,512 RAM,30 GB HD
Ave CFG: 4xP6512 RAM28 GB HD
Ave CFG: 4xP6,256 RAM,30 GB HDAve Cost: $25KFY98 Fcst: 2
Ave CFG: 4xP6,512 RAM,30 GB HD
Ave CFG: 4xP6,512 RAM,50 GB HD
Ave CFG: 4xP5,512 RAM,30 GB HD
Ave CFG: 4xP6,512 RAM,160 GB HD
Ave CFG: 4xP6,
Ave CFG: 4xP5,512 RAM,30 GB HD
Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 7
Ave CFG: 4xP5,256 RAM,20 GB HD
Ave CFG: 4xP6,512 RAM,30 GB HD
Ave CFG: 4xP6,512 RAM,50 GB HD
Ave CFG: 4xP6,512 RAM,160 GB HD
Ave CFG: 4xP6,512 RAM,160 GB HD
FTP.microsoft.com(3)
Ave CFG: 4xP5,512 RAM,30 GB HD
Ave CFG: 4xP6,512 RAM,30 GB HD
Ave CFG: 4xP6,512 RAM,30 GB HD
Ave CFG: 4xP6,1 GB RAM,160 GB HDAve Cost: $83KFY98 Fcst: 2
IDC Staging Servers
Live SQL Servers
SQL Consolidators
Japan Data Center
Internet
Internet
www.microsoft.com(3)
Ave CFG: 4xP6,512 RAM,50 GB HD
17
NCSA Super Cluster
• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana
• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
18
TPC C Improved Fast(250%/year!)
1.52.755676
$/tpmC vs time
$10
$100
$1,000
Jan-93 Jun-94 Oct-95 Mar-97 Jul-98
Date
$/tp
mC
250 %/year improvement!
tpmC vs time
100
1,000
10,000
100,000
Jan-93 Jun-94 Oct-95 Mar-97 Jul-98
Date
tpm
C 250 %/year improvement!
40% hardware, 40% hardware, 100% software, 100% software,
100% PC Technology100% PC Technology
19
Windows NT Versus UNIXtpmC vs Time
05,000
10,00015,00020,00025,00030,00035,000
Jan-95 Jan-96 Jan-97
tpm
C
h Unix NT
20
Economy Of Scale
Transactions/k$ By Vendor
0.0
5.0
10.0
15.0
20.0
25.0
0 10,000 20,000 30,000 40,000
tpmC
tpm
C/k
$
DB2/Unix
Sybase/Unix
Informix/Unix
Microsoft/NT
Oracle/Unix
21
Microsoft TerraServer: Scaleup to Big Databases
• Build a 1 TB SQL Server database• Data must be
» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere
• Loaded » 1.5 M place names from Encarta World Atlas» 3 M Sq Km from USGS (1 meter resolution)» 1 M Sq Km from Russian Space agency (2 m)
• On the web (world’s largest atlas)• Sell images with commerce server.
22
Microsoft TerraServer Background
• Earth is 500 Tera-meters square» USA is 10 tm2
• 100 TM2 land in 70ºN to 70ºS
• We have pictures of 6% of it» 3 tsm from USGS
» 2 tsm from Russian Space Agency
• Compress 5:1 (JPEG) to 1.5 TB.
• Slice into 10 KB chunks
• Store chunks in DB
• Navigate with
» Encarta™ Atlas• globe
• gazetteer
» StreetsPlus™ in the USA
40x60 km2 jump image
20x30 km2 browse image
10x15 km2 thumbnail
1.8x1.2 km2
tile
• Someday» multi-spectral image
» of everywhere
» once a day / hour
23
Demo • navigate by coverage map to White House
• Download image
• buy imagery from USGS
• navigate by name to Venice
• buy SPIN2 image & Kodak photo
• Pop out to Expedia street map of Venice
• Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)
24
The Microsoft TerraServer Hardware
• Compaq AlphaServer 8400
• 8x400Mhz Alpha cpus
• 10 GB DRAM
• 324 9.2 GB StorageWorks Disks» 3 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)
• WindowsNT 4 EE, SQL Server 7.0
25
browser
HTMLJava
Viewer
The Internet
Web Client
Microsoft AutomapActiveX Server
Internet InfoServer 4.0
Image DeliveryApplication
SQL Server7
MicrosoftSite Server EE
Internet InformationServer 4.0
Image Provider Site(s)
TerraServer DB Automap Server
Terra-ServerStored Procedures
InternetInformationServer 4.0
ImageServer
Active Server Pages
MTS
TerraServer Web Site
Software
SQL Server 7
26
Image Delivery and LoadIncremental load of 4 more TB in next 18 months
DLTTape “tar”
\Drop’N’ DoJobWait 4Load
LoadMgrDB
100mbitEtherSwitch
108 9.1 GBDrives
Enterprise Storage Array
AlphaServer8400
108 9.1 GBDrives
108 9.1 GBDrives
STKDLTTape
Library
604.3 GBDrives
AlphaServer4100
ESAAlphaServer4100
LoadMgr
DLTTape
NTBackup
ImgCutter
\Drop’N’ \Images
10: ImgCutter20: Partition30: ThumbImg40: BrowseImg45: JumpImg50: TileImg55: Meta Data60: Tile Meta70: Img Meta80: Update Place
...LoadMgr
27
71 Total Average PeakHits 728.45m 10.26m 29.27m
Queries 565.09m 7.96m 17.76mImages 212.02m 2.99m 9.23m
PageViews 376.29m 5.30m 9.20m
TerraServer: A Real “World” Example
• Largest DB on the Web
• 1.3TB
• 99.95% uptime since July 1
• No downtime, period, in August
• 70% of downtime for SQL software upgrades
28
NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE
• Scale UP an SMP: TerraServer
• Scale OUT with a cluster of machines
• Single-system image
»Naming
»Protection/security
»Management/load balance
• Fault tolerance
»“Wolfpack”
• Hot pluggable hardware & software
29
Web Web sitesite
DatabaseDatabase
Web site filesWeb site files
Database filesDatabase files
Server 1Server 1
BrowserBrowser
Symmetric Virtual Server Failover Example
Server 1Server 1 Server 2Server 2
Web site filesWeb site files
Database filesDatabase files
Web Web sitesite
DatabaseDatabase
Web Web sitesite
DatabaseDatabase
30
Windows NT 5 (scalability features)
• Better SMP support• Clusters:
»16x packs (fault tolerant clusters)» 100x mobs: arrays for manageability»SAN/VIA support
• 64 bit addressing for data»Apps like SQL, Oracle, will use it for data»64 bit API to NT comes later (in lab now).
• Remote management (scripting and DCOM)• Active Directory• Veritas volume manager • Many 3rd party HSMs• Batch support
31
Microsoft SQL Server 7.0• Fixes the famous performance bugs
»dynamic record locking
»online backup, quick recovery….
• 64 bit addressing buffer pool
• SMP parallelism and better SMP support
• Built in OLAP (cubes and MOLAP)
• Scale down to Win9x
• Improved management interfaces
• Data transform services (for warehouses)
32
Outline
• What is Scalability
• Why does Microsoft care about ScaleUp
• Current ScaleUp Status?
• NT5 & SQL7
33
end
Other slides would be interesting, but...
34
Interesting “other slides”No time for them but...
• How much information is there?
• IO bandwidth in the Intel world
• Intelligent disks
• SAN/VIA
• NT Cluster Sort
35
Some Tera-Byte DatabasesKilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
• The Web: 1 TB of HTML
• TerraServer 1 TB of images
• Several other 1 TB (file) servers
• Hotmail: 7 TB of email
• Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked
• EOS/DIS (picture of planet each week)» 15 PB by 2007
• Federal Clearing house: images of checks» 15 PB by 2006 (7 year history)
• Nuclear Stockpile Stewardship Program» 10 Exabytes (???!!)
36
Library of Congress (text)
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
A novel
A letter
All Disks
All Tapes
A Movie
LoC (image)
Info Capture• You can record everything you see or hear or read.
• What would you do with it?
• How would you organize & analyze it?
Video 8 PB per lifetime (10GBph)Audio 30 TB (10KBps) Read or write: 8 GB (words)
See: http://www.lesk.com/mlesk/ksg97/ksg.html
37
Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html
• Soon everything can be recorded and kept
• Most data will never be seen by humans
• Precious Resource: Human attention Auto-SummarizationAuto-Search
will be a key enabling technology.
38
PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point)
System Bus422 MBps
7.2 MB/s
133 MBps7.2 MB/s
10-15 MBps7.2 MB/s
SCSIFile System Buffers
ApplicationData
Disk
PCI
40 MBps7.2 MB/s
39
PAP vs RAP• Reads are easy, writes are hard
• Async write can match WCE.
•
422 MBps
142 MBps
133 MBps
72 MBps
10-15 MBps
9 MBps
SCSI
File System
ApplicationData
PCI SCSI
Disks40 MBps
31 MBps
40
Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI
(not measured, we had only one PCI bus available, 2nd one was “internal”)
~ 120 MBps Unbuffered read
~ 80 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Memory Read/Write ~150 MBps
PCI~70 MBps
Adapter~30 MBps
PCI
Adapter
Adapter
Adapter
120
MB
ps
41
Year 2002 Disks• Big disk (10 $/GB)
» 3”
» 100 GB
» 150 kaps (k accesses per second)
» 20 MBps sequential
• Small disk (20 $/GB)» 3”
» 4 GB
» 100 kaps
» 10 MBps sequential
• Both running Windows NT™ 7.0?(see below for why)
42
How Do They Talk to Each Other?• Each node has an OS
• Each node has local resources: A federation.
• Each node does not completely trust the others.
• Nodes use RPC to talk to each other» CORBA? DCOM? IIOP? RMI?
» One or all of the above.
• Huge leverage in high-level interfaces.
• Same old distributed system story.
Wire(s)h
stre
ams
data
gram
s
RP
C?
Applications
VIAL/VIPL
streams
datagrams
RP
C ?
Applications
43
Gbps Ethernet: 110 MBps
SAN: Standard
Interconnect
PCI 32: 70 MBps
UW Scsi: 40 MBps
FW scsi: 20 MBps
scsi: 5 MBps
• LAN faster than memory bus?
• 1 GBps links in lab.
• 300$ port cost soon
• Port is computer
RIPFDDI
RIPATM
RIPSCI
RIPSCSI
RIPFC
RIP?
44
PennySort• Hardware
» 266 Mhz Intel PPro
» 64 MB SDRAM (10ns)
» Dual Fujitsu DMA 3.2GB EIDE
• Software» NT workstation 4.3
» NT 5 sort
• Performance» sort 15 M 100-byte records (~1.5 GB)
» Disk to disk
» elapsed time 820 sec • cpu time = 404 sec
PennySort Machine (1107$ )
board13%
Memory8%
Cabinet + Assembly
7%
Network, Video, floppy
9%
Software6%
Other22%
cpu 32%
Disk25%
45
Cluster Sort Conceptual Model
•Multiple Data Sources
•Multiple Data Destinations
•Multiple nodes
•Disks -> Sockets -> Disk -> DiskB
AAABBBCCC
A
AAABBBCCC
C
AAABBBCCC
BBBBBBBBB
AAAAAAAAA
CCCCCCCCC
BBBBBBBBB
AAAAAAAAA
CCCCCCCCC
46
Cluster Install & Execute
•If this is to be used by others, it must be:
•Easy to install•Easy to execute
• Installations of distributed systems take time and can be tedious. (AM2, GluGuard)
• Parallel Remote execution is non-trivial. (GLUnix, LSF)
How do we keep this “simple” and “built-in” to NTClusterSort ?
47
Remote Install
RegConnectRegistry()
RegCreateKeyEx()
•Add Registry entry to each remote node.
48
Cluster Execution
MULT_QI COSERVERINFO•Setup :
MULTI_QI structCOSERVERINFO struct
•CoCreateInstanceEx()
•Retrieve remote object handle from MULTI_QI struct
•Invoke methods as usual
HANDLEHANDLE
HANDLE
Sort()
Sort()
Sort()