www.panasas.com go faster. go parallel. panasas parallel file system brent welch director of...
TRANSCRIPT
www.panasas.com
Go Faster. Go Parallel.
Panasas Parallel File System
Brent Welch
Director of Software Architecture, Panasas IncOctober 16, 2008HPC User Forum
Slide 2 | October 2008 IDC, Panasas, Inc.
Outline
Panasas background
Storage cluster background, hardware and software
Technical Topics
High Availability
Scalable RAID
pNFS
Brent’s garden, California
Slide 3 | October 2008 IDC, Panasas, Inc.
Founded 1999 By Prof. Garth Gibson, Co-Inventor of RAID
Technology Parallel File System and Parallel Storage Appliance
Locations US: HQ in Fremont, CA, USA
R&D centers in Pittsburgh & Minneapolis
EMEA: UK, DE, FR, IT, ES, BE, Russia
APAC: China, Japan, Korea, India, Australia
Customers FCS October 2003, deployed at 200+ customers
Market Focus
Alliances
Energy Academia Government Life Sciences Manufacturing Finance ISVs: Resellers:
Primary Investors
Panasas Company Overview
Slide 4 | October 2008 IDC, Panasas, Inc.
The New ActiveStor Product Line
Design, Modeling and Visualization Applications
Simulation andAnalysis Applications
Tiered QOS StorageBackup/ Secondary
ActiveStor 6000• 20, 15 or 10TB shelves• 20 GB cache/shelf• Integrated 10GigE• 600 MB/sec/shelf• Tiered Parity• ActiveImage• ActiveGuard• ActiveMirror
ActiveScale 3.2
ActiveStor 4000• 20 or 15TB shelves• 5 GB cache/shelf• Integrated 10GigE• 600 MB/sec/shelf• Tiered Parity Options:• ActiveImage• ActiveGuard• ActiveMirror
ActiveStor 200• 104 TBs (3DBs x 52SBs)• 5 shelves (20U, 35”)• Single 1GigE port/shelf• 350 MB/sec aggregate• 20 GB aggregate cache• Tiered Parity
Slide 5 | October 2008 IDC, Panasas, Inc.
Leaders in HPC choose Panasas
Slide 6 | October 2008 IDC, Panasas, Inc.
US DOE: Panasas Selected for Roadrunner – Top of the Top 500LANL $130M system will deliver 2x performance over current top BG/L
SciDAC: Panasas CTO selected to lead Petascale Data Storage InstCTO Gibson leads PDSI launched Sep 06, leveraging experience from PDSI members: LBNL/Nersc; LANL; ORNL; PNNL Sandia NL; and UCSC
Aerospace: Airframes and engines, both commercial and defenseBoeing HPC file system; major engine mfg; top 3 U.S. defense contractors
Formula-1: HPC file system for Top 2 clusters – 3 teams in totalTop clusters at Renault F-1 and BMW Sauber, Ferrari also on Panasas
Intel: Certifies Panasas storage for broad range of HPC applications, now ICRIntel uses Panasas storage for EDA design, and in HPC benchmark center
SC07: Six Panasas customers won awards at SC07 (Reno) conference
Validation: Extensive recognition and awards for HPC breakthroughs
Panasas Leadership Role in HPC
Slide 7 | October 2008 IDC, Panasas, Inc.
Panasas Joint Alliance Investments with ISVsPanasas Joint Alliance Investments with ISVs
Panasas ISV AlliancesApplications
Seismic Processing; Interpretation;Reservoir Modeling
Computational Fluid Dynamics (CFD); Comp StructuralMechanics (CSM)
Electronic Data Automation (EDA)
Trading;Derivatives Pricing;Risk Analysis
Vertical FocusEnergy
Manufacturing;Government; Higher Ed and Res;Energy
Semiconductor
Financial
Slide 8 | October 2008 IDC, Panasas, Inc.
Intel Use of Panasas in the IC Design Flow
Panasas criticalto tape-out stage
Slide 9 | October 2008 IDC, Panasas, Inc.
University of CambridgeHPC Service, Darwin Supercomputer
Darwin Supercomputer Computational UnitsNine repeating units, each consists of 64 nodes (2 racks) providing 256 cores each, 2340 cores total
All nodes within a CU connected to a full bisectional bandwidth Infiniband 900 MB/s, MPI latency of 2 s
Description of Darwin at U of Cambridge
Source: http://www.hpc.cam.ac.uk
Slide 10 | October 2008 IDC, Panasas, Inc.
Number of cells 111,091,452
Solver PBNS, DES, Unsteady
Iterations5 time steps, 100 total iters - data save after last iteration
Output size:FLUENT v6.3 (serial I/O; size of .dat file)
FLUENT v12 (serial I/O; size of .dat file)
FLUENT v12 (parallel I/O; size of .pdat file)
14,808 MB
16,145 MB
19, 683 MB
Unsteady external aero for 111 MM cell truck; 5 time steps with 100 iterations, and a single .dat file write
Univ of Cambridge DARWIN ClusterLocation: University of Cambridge http://www.hpc.cam.ac.uk
Vendor: Dell ; 585 nodes; 2340 cores; 8 GB per node; 4.6 TB total mem
CPU: Intel Woodcrest DC, 3.0 GHz / 4MB L2 cache
Interconnect: InfiniPath QLE7140 SDR HCAs; Silverstorm 9080 and 9240 switches,
File System: Panasas PanFS, 4 shelves, 20 TB capacity
Operating System: Scientific Linux CERN SLC release 4.6
DARWIN 585 nodes; 2340 cores
Panasas: 4 Shelves, 20 TB
Truck111M Cells
Details of the FLUENT 111M Cell Model
Slide 11 | October 2008 IDC, Panasas, Inc.
Number of Cores
2541
1318
779
4568
2680
1790
0
1000
2000
3000
4000
5000
64 128 256
PanFS -- FLUENT 12
NFS -- FLUENT 6.3
Truck Aero
111M Cells
Scalability of Solver + Data File WriteTi
me
(Sec
onds
) of S
olve
r + D
ata
File
Writ
e Time of Solver + Data File Write
Lower is
better
1.7x
1.5x
1.9x 1.7
x
FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster
NOTE: Read times are not included in
these results
Slide 12 | October 2008 IDC, Panasas, Inc.
Number of Cores
273
345 334
0
100
200
300
400
64 128 256
PanFS -- FLUENT 12
NFS -- FLUENT 6.3
HigherIs
Better
Truck Aero
111M Cells
Performance of Data File Write in MB/sEf
fect
ive
Rat
es o
f I/O
(MB
/s) f
or D
ata
Writ
e
39x
Effective Rates of I/O for Data File Write
FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster
31x 20
xNOTE: Data
File Write Only
Slide 13 | October 2008 IDC, Panasas, Inc.
Panasas Architecture
Cluster technology provides scalable capacity and performance: capacity scales symmetrically with processor, caching, and network bandwidth
Scalable performance with commodity parts provides excellent price/performance
Object-based storage provides additional scalability and security advantages over block-based SAN file systems
Automatic management of storage resources to balance load across the cluster
Shared file system (POSIX) with the advantages of NAS, with direct-to-storage performance advantages of DAS and SAN
DiskCPUMemoryNetwork
Slide 14 | October 2008 IDC, Panasas, Inc.
Panasas bladeserver building block
The ShelfStorageBlade
DirectorBlade
Mid Plane
BatteryEmbeddedSwitch
Power Supplies
Rails11 slots for blades
4Uhigh
Garth Gibson, July 200814
Slide 15 | October 2008 IDC, Panasas, Inc.
Panasas Blade Hardware
DirectorBlade StorageBlade
Integrated GE Switch
Shelf Front1 DB, 10 SB Shelf Rear
Midplane routes GE, power
Battery Module(2 Power units)
Slide 16 | October 2008 IDC, Panasas, Inc.
Panasas Product Advantages
Proven implementation with appliance-like ease of use/deployment
Running mission-critical workloads at global F500 companies
Scalable performance with Object-based RAID
No degradation as the storage system scales in size
Unmatched RAID rebuild rates – parallel reconstruction
Unique data integrity features
Vertical parity on drives to mitigate media errors and silent corruptions
Per-file RAID provides scalable rebuild and per-file fault isolation
Network verified parity for end-to-end data verification at the client
Scalable system size with integrated cluster management
Storage clusters scaling to 1000+ storage nodes, 100+ metadata managers
Simultaneous access from over 12000 servers
Slide 17 | October 2008 IDC, Panasas, Inc.
Linear Performance Scaling
Breakthrough data throughput AND random I/O
Performance and scalability for all workloads
Slide 18 | October 2008 IDC, Panasas, Inc.
Scaling Clients
IOzone Multi-Shelf Performance Test (4GB sequential write/read)
0
500
1,000
1,500
2,000
2,500
3,000
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 155 200
# clients
MB
/sec
1-shelf write
1-shelf read
2-shelf write
2-shelf read
4-shelf write
4-shelf read
8-shelf write
8-shelf read
Slide 19 | October 2008 IDC, Panasas, Inc.
Proven Panasas Scalability
Storage Cluster Sizes Today (e.g.)
Boeing, 50 DirectorBlades, 500 StorageBlades in one system. (plus 25 DirectorBlades and 250 StorageBlades each in two other smaller systems.)
LANL RoadRunner. 100 DirectorBlades, 1000 StorageBlades in one system today, planning to increase to 144 shelves next year.
Intel has 5,000 active DF clients against 10-shelf systems, with even more clients mounting DirectorBlades via NFS. They have qualified a 12,000 client version of 2.3, and will deploy “lots” of compute nodes against 3.2 later this year.
BP uses 200 StorageBlade storage pools as their building block
LLNL, two realms, each 60 DirectorBlades (NFS) and 160 StorageBlades
Most customers run systems in the 100 to 200 blade size range
Slide 20 | October 2008 IDC, Panasas, Inc.
Emphasis on Data Integrity
Horizontal Parity
Per-file, Object-based RAID across OSD
Scalable on-line performance
Scalable parallel RAID rebuild
Vertical Parity
Detect and eliminate unreadable sectors and silent data corruption
RAID at the sector level within a drive / OSD
Network Parity
Client verifies per-file parity equation during reads
Provides only truly end-to-end data integrity solution available today
Many other reliability features…
Media scanning, metadata fail over, network multi-pathing, active hardware monitors, robust cluster management
Slide 21 | October 2008 IDC, Panasas, Inc.
High Availability
Quorum based cluster management
3 or 5 cluster managers to avoid split brain
Replicated system state
Cluster manager controls the blades and all other services
High performance file system metadata fail over
Primary-backup relationship controlled by cluster manager
Low latency log replication to protect journals
Client-aware fail over for application-transparency
NFS level fail over via IP takeover
Virtual NFS servers migrate among DirectorBlade modules
Lock services (lockd/statd) fully integrated with fail over system
Slide 22 | October 2008 IDC, Panasas, Inc.
Technology Review
Turn-key deployment and automatic resource configuration
Scalable Object RAID
Very fast RAID rebuild
Vertical Parity to trap silent corruptions
Network parity for end-to-end data verification
Distributed system platform with quorum-based fault tolerance
Coarse grain metadata clustering
Metadata fail over
Automatic capacity load leveling
Storage Clusters scaling to ~1000 nodes today
Compute clusters scaling to 12,000 nodes today
Blade-based hardware with 1Gb/sec building block
Bigger building block going forward
Slide 23 | October 2008 IDC, Panasas, Inc.
The pNFS Standard
The pNFS standard defines the NFSv4.1 protocol extensions between the server and client
The I/O protocol between the client and storage is specified elsewhere, for example:
SCSI Block Commands (SBC) over Fibre Channel (FC)
SCSI Object-based Storage Device (OSD) over iSCSI
Network File System (NFS)
The control protocol between the server and storage devices is also specified elsewhere, for example:
SCSI Object-based Storage Device (OSD) over iSCSI
ClientStorage
NFS 4.1 Server
Slide 24 | October 2008 IDC, Panasas, Inc.
Key pNFS Participants
Panasas (Objects)
Network Appliance (Files over NFSv4)
IBM (Files, based on GPFS)
EMC (Blocks, HighRoad MPFSi)
Sun (Files over NFSv4)
U of Michigan/CITI (Files over PVFS2)
Slide 25 | October 2008 IDC, Panasas, Inc.
pNFS Status
pNFS is part of the IETF NFSv4 minor version 1 standard draft
Working group is passing draft up to IETF area directors, expect RFC later in ’08
Prototype interoperability continuesSan Jose Connect-a-thon March ’06, February ’07, May ‘08
Ann Arbor NFS Bake-a-thon September ’06, October ’07
Dallas pNFS inter-op, June ’07, Austin February ’08, (Sept ’08)
AvailabilityTBD – gated behind NFSv4 adoption and working implementations of pNFS
Patch sets to be submitted to Linux NFS maintainer starting “soon”
Vendor announcements in 2008
Early adoptors in 2009
Production ready in 2010
www.panasas.com
Go Faster. Go Parallel.
Questions?
Thank you for your time!
Slide 27 | October 2008 IDC, Panasas, Inc.
Deep Dive: Reliability
High Availability
Cluster Management
Data Integrity
Slide 28 | October 2008 IDC, Panasas, Inc.
Vertical Parity
“RAID” within an individual drive
Seamless recovery from media errors by applying RAID schemes across disk sectors
Repairs media defects by writing through to spare sectors
Detects silent corruptions and prevents reading wrong data
Independent of horizontal array-based parity schemes
Vertical Parity
Slide 29 | October 2008 IDC, Panasas, Inc.
Network Parity
Horizontal Parity
Vertical Parity
Network Parity
Extends parity capability across the data path to the client or server node
Enables End-to-End data integrity validation
Protects from errors introduced by disks, firmware, server hardware, server software, network components and transmission
Client either receives valid data or an error notification
Slide 30 | October 2008 IDC, Panasas, Inc.
Panasas Scalability
Two Layer Architecture
Division between these two layers is an important separation of concerns
Platform maintains a robust system model and provides overall control
File system is an application layered over the distributed system platform
Automation in the distributed system platform helps the system adapt to failures without a lot of hand-holding by administrators
The file system uses protocols optimized for performance, and relies on the platform to provide robust protocols for failure handling
Parallel File System
Distributed System Platform
Hardware
Applications
Slide 31 | October 2008 IDC, Panasas, Inc.
Model-based Platform
The distributed system platform maintains a model of the system
what are the basic configuration settings (networking, etc.)
which storage and manager nodes are in the system
where services are running
what errors and faults are present in the system
what recovery actions are in progress
Model-based approach mandatory to reduce administrator overhead
Automatic discovery of resources
Automatic reaction to faults
Automatic capacity balancing
Proactive hardware monitoring
Manual
Automatic
Slide 32 | October 2008 IDC, Panasas, Inc.
Quorum-based Cluster Management
The system model is replicated on 3 or 5 (or more) nodes
Maintained via a Lamport’s PTP (Paxos) quorum voting protocol
PTP handles quorum membership change, brings partitioned members back up to date, provides basic transaction model
Each member keeps the model in a local database that is updated within a PTP transaction
7 or 14 msec update cost, which is dominated by 1 or 2 synchronous disk IOs
This robust mechanism is not on the critical path of any file system operation
Yes: change admin password (involves cluster manager)
Yes: configure quota tree
Yes: initiate service fail over
No: open close read write etc. (does not involve cluster manager)
Slide 33 | October 2008 IDC, Panasas, Inc.
File Metadata
PanFS metadata manager stores metadata in object attributes
All component objects have simple attributes like their capacity, length, and security tag
Two component objects store replica of file-level attributes, e.g. file-length, owner, ACL, parent pointer
Directories contain hints about where components of a file are stored
There is no database on the metadata manager
Just transaction logs that are replicated to a backup via a low-latency network protocol
Slide 34 | October 2008 IDC, Panasas, Inc.
File Metadata
File ownership divided along file subtree boundaries (“volumes”)
Multiple metadata managers, each own one or more volumes
Match with the quota-tree abstraction used by the administrator
Creating volumes creates more units of meta-data work
Primary-backup failover model for MDS
90us remote log update over 1GE, vs. 2us local in-memory log update
Some customers introduce lots of volumes
One volume per DirectorBlade module is ideal
Some customers are stubborn and stick with just one
E.g., 75 TB and 144 StorageBlade modules and a single MDS
Slide 35 | October 2008 IDC, Panasas, Inc.
Scalability over time
Software baseline
Single software product, but with branches corresponding to major releases that come out every year (or so)
Today most customers are on 3.0.x, which had a two year lifetime
We are just introducing 3.2, which has been in QA and beta for 7 months
There is forward development on newer features
Our goal is a major release each year, with a small number of maintenance releases in between major releases
Compatibility is key
New versions upgrade cleanly over old versions
Old clients communicate with new servers, and vice versa
Old hardware is compatible with new software
Integrated data migration to newer hardware platforms
Slide 36 | October 2008 IDC, Panasas, Inc.
Deep Dive: Networking
Scalable networking infrastructure
Integration with compute cluster fabrics
Slide 37 | October 2008 IDC, Panasas, Inc.
LANL Systems
Tourquoise Unclassified
Pink 10TF - TLC 1TF - Coyote 20TF (compute)
1 Lane PaScalBB ~ 100 GB/sec (network)
68 Panasas Shelves, 412 TB, 24 GBytes/sec (storage)
Unclassified Yellow
Flash/Gordon 11TF - Yellow Roadrunner Base 5TF – (4 RRp3 CU’s 300 TF)
2 Lane PaScalBB ~ 200 GB/sec
24 Panasas Shelves - 144 TB - 10 GBytes/sec
Classified Red
Lightning/Bolt 35TF - Roadrunner Base 70TF (Accelerated 1.37PF)
6 Lane PaScalBB ~ 600 GB/sec
144 Panasas Shelves - 936 TB - 50 GBytes/sec
Slide 38 | October 2008 IDC, Panasas, Inc.
IB and other network fabrics
Panasas is a TCP/IP, GE-based storage product
Universal deployment, Universal routability
Commodity price curve
Panasas customers use IB, Myrinet, Quadrics, …
Cluster interconnect du jour for performance, not necessarily cost
IO routers connect cluster fabric to GE backbone
Analogous to an “IO node”, but just does TCP/IP routing (no storage)
Robust connectivity through IP multipath routing
Scalable throughput at approx 650 MB/sec IO router (PCI-e class)
Working on a 1GB/sec IO router
IB-GE switching platforms
QLogic or Voltare switch provides wire-speed bridging
Slide 39 | October 2008 IDC, Panasas, Inc.
NFS and other network services,
WAN
Secure Core switches
Archive
Petascale Red Infrastructure Diagram with Roadrunnner Accelerated FY08
Scalable to
600 GB/sec before adding
Lanes
IB4X
FatTree
Myr inet
Myr inet
Site wide
Shared Global Parallel
File System
Roadruner Phase 3
1.026 PF
1GE
IONODES
10GE
IONODES
Roadrunner Phase 1
70TF
Lightning/Bolt35 TF
FTA’sNx10GE
NxGENx10GE
4 GE per 5-8
TB
Compute Unit
Compute Unit
IO Unit
IO Unit
CU
CU
CU
CU
fail over
10GE
IONODES
Slide 40 | October 2008 IDC, Panasas, Inc.
NFS complex and other network services,
WAN
Archive
LANL Petascale (Red) FY08
Lane switches, 6 X 105 = 630 GB/s
IB4X
FatTree
Myr inet
Myr inet
Lane passthru switches, to
connect legacy Lightning/Bolt
Site wide Shared Global
Parallel File System
650-1500 TB 50-200 GB/s
(spans lanes)
Road Runner Base, 70TF,
144 node units, 12 I/O
nodes/unit, 4 socket dual core AMD
nodes, 32 GB mem/node, full
fat tree, 14 units,
Acceleration to 1 PF sustained
64 I/O nodes, 2 – 1gbit links
each, 16 GB/sec
96 I/O nodes, 2 – 1gbit links
each, 24 GB/sec
156 I/O nodes, 1 – 10gbit link
each, 195 GB/sec,
planned for Accelerated
Road Runner 1PF sustained
Bolt, 20TF, 2 socket,
singe/dual core AMD, 256 node
units, reduced fat tree, 1920
nodes
Lightning, 14 TF, 2 socket single core AMD, 256 node units, full fat tree, 1608 nodes
FTA’sN gb/s
20 gb/s per link
N gb/sN gb/s
4 gb/s per 5-8
TB
Compute Unit
Compute Unit
IO Unit
IO Unit
CU
CU
CU
CU
N gb/s
fail over
If more bandwidth is needed we just need to add more lanes and add more storage, scales nearly
linearly, not N2 like a fat tree Storage Area Network.
Slide 41 | October 2008 IDC, Panasas, Inc.
To Site Network
Multi-Cluster sharing: scalable BW with fail over
NFS
DNS1
KRB
Cluster C
Compute Nodes
Archive
Layer 2 switches
Panasas
Storage
Cluster A
I/O Nodes
Cluster B
Colors depict
subnets
Slide 42 | October 2008 IDC, Panasas, Inc.
Deep Dive: Scalable RAID
Per-file RAID
Scalable RAID rebuild
Slide 43 | October 2008 IDC, Panasas, Inc.
Automatic per-file RAID
System assigns RAID level based on file size
<= 64 KB RAID 1 for efficient space allocation
> 64 KB RAID 5 for optimum system performance
> 1 GB two-level RAID-5 for scalable performance
RAID-1 and RAID-10 for optimized small writes
Automatic transition from RAID 1 to 5 without re-striping
Programmatic control for application-specific layout optimizations
Create with layout hint
Inherit layout from parent directory
Small File
RAID 1 Mirroring
RAID 5 Striping
Large File
Very Large File
2-level RAID 5 Striping
Clients are responsible for writing data and its parity
Slide 44 | October 2008 IDC, Panasas, Inc.
H G
k E
Declustered RAID
Files are striped across component objects on different StorageBlades
Component objects include file data and file parity for reconstruction
File attributes are replicated with two component objects
Declustered, randomized placement distributes RAID workload
C F
E
2-shelfBladeSet
Mirroredor 9-OSDParityStripes
FAIL
Read abouthalf of eachsurviving OSDWrite a littleto each OSD
Scales linearly
Slide 45 | October 2008 IDC, Panasas, Inc.
Scalable RAID Rebuild
Shorter repair time in larger storage pools
Customers report 30 minute rebuilds for 800GB in 40+ shelf blade set
Variability at 12 shelves due to uneven utilization of DirectorBlade modules
Larger numbers of smaller files was better
Reduced rebuild at 8 and 10 shelves because of wider parity stripe
Rebuild bandwidth is the rate at which data is regenerated (writes)
Overall system throughput is N times higher because of the necessary reads
Use multiple “RAID engines” (DirectorBlades) to rebuild files in parallel
Declustering spreads disk I/O over more disk arms (StorageBlades)
0
20
40
60
80
100
120
140
0 2 4 6 8 10 12 14
# Shelves
One Volume, 1G Files
One Volume, 100MB Files
N Volumes, 1GB Files
N Volumes, 100MB Files
Rebuild MB/sec
Slide 46 | October 2008 IDC, Panasas, Inc.
Scalable rebuild is mandatory
Having more drives increases risk, just like having more light bulbs increases the odds one will be burnt out at any given time
Larger storage pools must mitigate their risk by decreasing repair times
The math saysif (e.g.) 100 drives are in 10 RAID sets of 10 drives each and
each RAID set has a rebuild time of N hours
The risk is the same if you have a single RAID set of 100 drives
and the rebuild time is N/10
Block-based RAID scales the wrong direction for this to work Bigger RAID sets repair more slowly because more data must be read
Only declustering provides scalable rebuild rates
Total number of drives Drives per RAID set Repair time
Slide 47 | October 2008 IDC, Panasas, Inc.
Deep Dive: pNFS
Standards-based parallel file systems: NFSv4.1
Slide 48 | October 2008 IDC, Panasas, Inc.
pNFS: Standard Storage Clusters
pNFS is an extension to the Network File System v4 protocol standard
Allows for parallel and direct accessFrom Parallel Network File System clients
To Storage Devices over multiple storage protocols
Moves the Network File System server out of the data path
pNFSClients
Block (FC) /Object (OSD) /
File (NFS)StorageNFSv4.1 Server
data
metadatacontrol
Slide 49 | October 2008 IDC, Panasas, Inc.
pNFS Layouts
Client gets a layout from the NFS Server
The layout maps the file onto storage devices and addresses
The client uses the layout to perform direct I/O to storage
At any time the server can recall the layout
Client commits changes and returns the layout when it’s done
pNFS is optional, the client can always use regular NFSv4 I/O
Clients
Storage
NFSv4.1 Server
layout
Slide 50 | October 2008 IDC, Panasas, Inc.
pNFS Client
Common client for different storage back ends
Wider availability across operating systems
Fewer support issues for storage vendors
Client Apps
LayoutDriver
pNFS Client
pNFS Server
ClusterFilesystem
1. SBC (blocks)2. OSD (objects)3. NFS (files)4. PVFS2 (files)5. Future backend…
Layout metadatagrant & revoke
NFSv4.1
Slide 51 | October 2008 IDC, Panasas, Inc.
pNFS is not…
Improved cache consistency
NFS has open-to-close consistency enforced by client polling of attributes
NFSv4.1 directory delegations can reduce polling overhead
Perfect POSIX semantics in a distributed file system
NFS semantics are good enough (or, all we’ll give you)
But note also the POSIX High End Computing Extensions Working Group
http://www.opengroup.org/platform/hecewg/
Clustered metadata
Not a server-to-server protocol for scaling metadata
But, it doesn’t preclude such a mechanism
Slide 52 | October 2008 IDC, Panasas, Inc.
Prototype PNFS Performance
pNFS iozone Throughput8 clients, 1+10 SB1000, 5 GB files
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
1 2 3 4 5 6 7 8
# Clients
MB
/s
write
re-write
read
re-read
Slide 53 | October 2008 IDC, Panasas, Inc.
Prototype PNFS Performance
pNFS iozone Throughput8 clients, 4+18 system, 5 GB files
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
1 2 3 4 5 6 7 8
MB
/s
write
re-write
read
re-read
Slide 54 | October 2008 IDC, Panasas, Inc.
Is pNFS Enough?
Standard for out-of-band metadataGreat start to avoid classic server bottleneck
NFS has already relaxed some semantics to favor performance
But there are certainly some workloads that will still hurt
Standard framework for clients of different storage backendsFiles (Sun, NetApp, IBM/GPFS, GFS2)
Objects (Panasas, Sun?)
Blocks (EMC, LSI)
PVFS2
Your project… (e.g., dcache.org)
Slide 55 | October 2008 IDC, Panasas, Inc.
Storage Cluster Components
StorageBlade
Processor, Memory, 2 NICs, 2 spindles
Object storage system
Block management
DirectorBlade
Processor, Memory, 2 NICs
Distributed file system
File and Object management
Cluster management
NFS/CIFS re-export
Integrated hardware/software solution
11 blades per 4u shelf
Today: 1 to 100-shelf systems
Tomorrow: 1 to 500-shelf systems
Object-based Clustered File System
Smart, Commodity Hardware
Panasas ActiveScale Storage Cluster
Slide 56 | October 2008 IDC, Panasas, Inc.
Internal cluster management makes a large collection of blades work as a single system
iSCSI/OSD
OSDFS
StorageNodes
SysMgr
PanFS
NFS/CIFS
Client
Manager Nodes
Client
Compute Nodes
RPC
Out of Band architecture with direct, parallel paths from clients to storage nodes
Up to 12,000
1,000+
100+
Slide 57 | October 2008 IDC, Panasas, Inc.
Panasas Global Storage Model
Client Node
Panasas System A Panasas System B
BladeSet 1
VolX
VolY
VolZ
/panfs/sysa/delta/file2
BladeSet 2
delta
home
BladeSet 3
VolM
VolN
VolL
Client Node Client Node Client Node
/panfs/sysb/volm/proj38
TCP/IP network
PhysicalStorage
Pool
LogicalQuotaTree
DNS NameGlobal Name