gpfs%fpo’...

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

GPFS-‐FPO A Cluster File System for Scale-‐Out Applica8ons

Dinesh Subhrave. IBM Research


2

Mo8va8on: SANs are built for latency not bandwidth Advantages of using GPFS: r  High scale (thousands of nodes,

petabytes of storage), high performance, high availability, data integrity,

r  POSIX seman.cs, workload isola.on, enterprise features (security, snapshots, backup/restore, archive, asynchronous caching and replica.on)

Challenges: r  Adapt GPFS to shared nothing clusters r  Maximize applica.on performance

rela.ve to cost r  Failure is common, network is a

boMleneck

SAN

I/O boMleneck

Tradi.onal architecture

Shared nothing environment

Scale out performance

1 server -‐> 1 GBps

100 servers-‐> 100 GBps

0

200

400

600

800

1000

Seq w

rite

Seq r

ead

Rand

om Re

ad

Rand

om W

rite

Backw

ard Re

ad

Record

rewrite

Stride

Read

gpfs-‐san gpfs-‐snc

GPFS-‐FPO: Mo8va8on


3

Mul.ple Block Sizes

InfoSphere Big Insights

POSIX Map Reduce, Cloud

Disaster Recovery

Node N

MetadataN

SAN

Node 1

Data 1

Metadata1

Solid State Disks

Cache Node 2

Metadata2

Local or Rack AMached Disks

Cache Cache

Replicated metadata: No single point of failure

Mul.-‐.ered storage with automated policy-‐based migra.on

Data 2 Data N

Smart Caching

Data Layout Control

High Scale: 1000 nodes, PBs of data

GPFS-‐FPO: Architecture


4

r  Hardware Configuration: r  4 nodes, per node: 16 cores 96 GB 12 JBOD disks, 10 Gbps network, data disk cache enabled

r  YCSB Workload A Configuration r  YCSB: Region Heap Size 6 GB, Presplit 100, Record count 1M, threads=8, 4 region servers r  HDFS: 1 metadata 3 data replicas, 48 data disks r  GPFS: 3 metadata 3 data replicas, page pool 32 GB, system: 256KB block size, data pool 2M block size, data

block group factor 128, 48 data disks r  Hbase: major compaction disabled; minor compaction threshold 3, server threads 50/node

r  Similar results for YCSB Workload B-F

4

Excellent Performance

2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 5

Locality Awareness: Make applica8ons aware where data is located

r  Scheduler assigns tasks to nodes where data chunks reside r  needs to know the

loca.on of each chunk r  GPFS API maps each chunk to

its node loca.on r  Map is compacted to op.mize

its size

N1

D1

N1:D1 N1Task Engine: Loca1on Map

Scheduler

Read Loca1on Map

Schedule F(D1) on N1

F(D1)

GPFS

Connector pass locality info

0

50

100

150

w/ locality w/o locality

Grep (min)

Application


Chunks: allows applica8ons to define own logical block size

r  Workloads make two types of accesses r  Small blocks (eg: 512KB) for Index File

Lookups r  Large blocks (eg: 64MB) for File scans

r  Solu.on: Mul.ple block sizes in the same file system

r  New alloca.on scheme r  Block group factor for block size r  Effec.ve block size = block size * block group

factor r  File blocks are laid out based on effec.ve block

size

Application “chunk” FS block

New allocation policy

0

5

10

15

20

(512,256KB)

(128,1MB)

(32,4MB)

(8,16MB)

(2,64MB)

Grep (min)

(Metablock factor, Block Size)à

Optimum Block Size:

Caching and pre-fetching

effects at larger block

sizes

Effective block size: 128 MB


Write Affinity: Allow applica8ons to dictate layout

r  Wide striping with mul.ple replicas: r  No locality of data in a

par.cular node r  Bad for commodity

architectures due to excessive network traffic

r  Write Affinity for a file or a set of files on a given node r  All relevant files will be

allocated on the node, near a node or in the same rack

r  Configurability of replicas r  Each replica can be configured

for either wide striping or write affinity

N1

File

N2

N3

Read File 1

Wide striped layout: Read over local disk + network

N1

N2

N3

N4

Read File 1

Write affinity layout: Read over only local disk

N1

N2

N3

N4

Read File 1

Write affinity and wide striping with replica8on: Read over only local disk, recover from wide striped replica

Parallel Recovery

0

5

10

15

20

25

30

35

WriteAffinity WideStriping

TeraGen (min)


r  Observa.on r  Data is suited for GPFS-‐FPO r  Metadata is suited for regular GPFS

r  Make alloca.on type a storage pool property r  Metadata pool: no write affinity r  Data pools: write affinity

r  GPFS-‐FPO: distributed metadata r  No single point of failure but too many

metadata nodes r  Increases probability of any 3 metadata

nodes going down à loss of data r  Assumes replica.on=3

r  Typical metadata profile in GPFS-‐FPO r  One or two metadata nodes per rack

(failure domain) r  Random access paMerns

Hybrid alloca8on: treat metadata and data differently

Metadata: No write affinity

Rack 1 Rack 2 Rack 3

Data: Write Affinity


Pipelined Replica8on: Efficient replica8on of data

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

Node 1 Node 2 Node 3 Node 4 Node 5

N1

Client

N2

N3

B0

B1 B2

Write Write Pipeline

0

10

20

30

40

S ing le -‐ s o ur c e P ip e l ined

TeraGen(min)

r  Higher Degree of replica.on r  Pipelining for lower latency

r  Single source does not u.lize the bandwidth effec.vely

r  Able to saturate 1 Gbps Ethernet link


Fast Recovery: Recover quickly from disk failures

r  Handling failures is policy driven r  Incorporated node failure and

disk failure triggers r  Policy-‐based recovery

r  Restripe on a node failure or disk failure

r  Alterna.vely, rebuild the disk when the disk is replaced

r  Fast recovery r  Incremental recovery

r  Keep track of changes during failure and recover what is needed

r  Distributed restripe r  Restripe load is spread

out over all the nodes r  Quickly figure out what

blocks are needed to be recovered when a node fails

N1 N1 N1

File

N2

N3

Well replicated file N2 rebooted: Writes go to N1 and N3

N2 comes back and catches up by copying missed writes from N1 and N3 in parallel

N2

N3

N2

N3

New write

Copy Deltas

0

5

10

15

20

0 1 2

Grep(Min)

Failed Nodes


11

File System GPFS-FPO HDFS MapR

Robust No single point of failure

NameNode vulnerability

No single point of failure

Data Integrity High Evidence of data loss New file system

Scale Thousands of nodes Thousands of nodes Hundreds of nodes

POSIX Compliance Full – supports a wide range of applications

Limited Limited

Data Management Security, Backup, Replication

Limited Limited

MapReduce Performance

Good Good Good

Workload Isolation Supports disk isolation

No support Limited support

Traditional Application Performance

Good Poor performance with random reads and writes

Limited

Comparison with HDFS and MapR


r  4Q12 : GA r  1H13 :

r WAN Caching r Advanced Write Affinity r KVM Support

r  1Q14: r  Integrate with Platform Computing Scheduler:

DirectShuffle r  Encryption r Disaster Recovery

12

RoadMap


http://almaden.ibm.com

gpfs%fpo’...

Documents