cs a320 operating systems for engineers - …ssiewert/a320_doc/lectures/lecture-week... · fast...

November 8, 2013 Sam Siewert

CS A320 Operating Systems for Engineers

Lecture 12 – Block Driver Performance Considerations

Block Drivers

Efficiency and Performance

Sam Siewert

2

Sam Siewert 3

Linux Driver Writer Resources “Linux Device Drivers – 3rd Ed.”, by J. Corbet, A. Rubini, G. Kroah-Hartman, 2005, (0-596-00590-3), O’Reilly, publisher link, E-book link "PCI System Architecture", Tom Shanley and Don Anderson, 4th Edition, 1999, (ISBN 0-201-30974-2) MindShare, Inc., E-book link, publisher link, retailer link, library link.

http://www.oreilly.com/catalog/linuxdrive3/

http://search.safaribooksonline.com/0596005903

http://us.ebooks.com/books/120570.smm

http://mindshare.com/store/books.html

http://www.amazon.com/exec/obidos/ASIN/0201309742/mindsharecom/002-4003269-9900017

http://libraries.colorado.edu/search/i?SEARCH=0201309742

Digital Media Filesystems Three Types of Media Storage – Direct Attached Storage – e.g. SATA (Serial ATA) – Network Attached Storage – e.g. NFS – Storage Area Networks – e.g. SAS (Serial Attached SCSI), Fiber

Channel

Flash / RAM based SSD Still 10x++ More Costly than Spinning Media – Predictions for Demise of HDDs and RAID? – Cost is the Driver

Fast Storage is Either SSD, RAID or Hybrid

Sam Siewert 4

RAID Operates on LBAs/Sectors (Sometimes Files)

SAN/DAS RAID NAS – Filesystem on top of RAID RAID-10, RAID-50, RAID-60 – Stripe Over Mirror Sets – Stripe Over RAID-5 XOR Parity Sets – Stripe Over RAID-6 Reed-Soloman or Double-Parity Encoded Sets

EVEN/ODD Row Diagonal Parity Minimum Density Codes (Liberation) Reed-Solomon Codes

– Generalized Erasure Codes Cauchy Reed-Solomon, LDPC (Low Density Parity Codes), Weaver/Hover MDS (Maximal Distance Seperation) – For each Parity Device, Another Level of Fault Tolerance is Provided

– Larger Drives (Multi-terabyte), Larger arrays (100’s of drives), and Cost Reduction are Driving RAID6 and Higher Levels

Sam Siewert 5

RAID-10

Sam Siewert 6

A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 A6 A6

RAID-1 Mirror RAID-1 Mirror RAID-1 Mirror

RAID-0 Striping Over RAID-1 Mirrors

A7 A7 A8 A8 A9 A9 A10 A10 A11 A11 A12 A12

A1,A2,A3, … A12

RAID5,6 XOR Parity Encoding

MDS Encoding, Can Achieve High Storage Efficiency with N+1: N/(N+1) and N+2: N/(N+2)

Sam Siewert 7

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Stor

age

Effic

ienc

y

Number of Data Devices for 1 XOR or 2 P,Q Encoded Devices

RAID6

RAID5

RAID-50

Sam Siewert 8

A1

RAID-5 Set RAID-5 Set

B1 C1 D1 P(ABCD)

E1 F1 G1 H1 P(EFGH)

I1 J1 P(IJKL) K1 L1 M1 P(MNOP) N1 P1 O1

P(QRST) Q1 R1 S1 T1

A2 B2 C2 D2 P(ABCD)

E2 F2 G2 H2 P(EFGH)


P(QRST) Q2 R2 S2 T2

RAID-0 Striping Over RAID-5 Sets

A1,B1,C1,D1,A2,B2,C2,D2,E1,F1,G1,H1,…, Q2,R2,S2,T2

A1

RAID-6 Set RAID-6 Set

B1 C1 D1 P(ABCD)

E1 F1 G1 P(EFGH)

I1 J1 P(IJKL) K1 M1 P(MNOP) N1 O1 P(QRST) Q1 R1 S1

RAID-0 Striping Over RAID-6 Sets

A1,B1,C1,D1,A2,B2,C2,D2,E1,F1,G1,H1,…, Q2,R2,S2,T2

Disk5 Disk1 Disk2 Disk3 Disk4

Q(EFGH)

Disk6

H1 QABCD)

Q(IJKL)

Q(MNOP)

Q(QRST)

L1 P1 T1

A2 B2 C2 D2 P(ABCD)

E2 F2 G2 P(EFGH)

I2 J2 P(IJKL) K2 M2 P(MNOP) N2 O2 P(QRST) Q2 R2 S2

Disk5 Disk1 Disk2 Disk3 Disk4

Q(EFGH)

Disk6

H2 QABCD)

Q(IJKL)

Q(MNOP)

Q(QRST)

L2 P2 T2

RAID-60 (Reed-Solomon Encoding)

How RAID Relates to Erasure Codes

Erasure Codes Applied to Disk or SSD Devices

Sam Siewert

10

RAID is an Erasure Code

RAID-1 is an MDS EC (James Plank, U. of Tennessee)

Sam Siewert 11

Comparison of ECs

Data Devices = n Coding Devices = m Total = m+n Storage Efficiency: R=n/(n+m) – RAID1 2-Way, R=1/(1+1)=50%, MDS=1, Reads 2x Speed-up, 1x

Write – RAID1 3-Way, R=1/(1+2)=33%, MDS=2, 3x Read, 1x Write – RAID10 with 10 sets, R=10/(10+10)=50%, MDS=1, 20x Read, 10x

Write – RAID5 with 3+1 set, R=3/(3+1)=75%, MDS=1, 3x Read (Parity

Check?), RMW Penalty, Striding Issues – RAID6 with 7+2 set, R=5/(5+2)=71%, MDS=2, 5x Read, Reed-

Solomon Encode on Write and RMW Penalty – Beyond RAID6?

Cauchy Reed-Solomon Scales, but Encode, Decode Complexity High Low Density Parity Codes, Simpler, but not MDS

Sam Siewert 12

Read, Modify Write Penalty

Any Update that is Less than the Full RAID5 or RAID6 Set, Requires 1. Read Old Data and Parity – 2 Reads 2. Compute New Parity (From Old & New Data) 3. Write New Parity and New Data – 2 Writes Only Way to Remove Penalty is a Write-Back Cache to Coalesce Updates and Perform Full-Set Writes Always

Sam Siewert 13

A1

RAID-5 Set

B1 C1 D1 P(ABCD)

E1 F1 G1 H1 P(EFGH)


P(QRST) Q1 R1 S1 T1

Write A1 P(ABCD)new=A1new xor A1 xor P(ABCD) A1 B1 C1 D1 P(ABCD) 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 …

Conclusion Deeper Dive Into Erasure Codes (James Plank FAST Presentation) Lab 3 Discussion Linux RAID Demos Driver Discussion

Sam Siewert 14

Hiding IO Latency – Overlapping with Processing

Simple Design – Each Thread has READ, PROCESS, WRITE-BACK Execution

Frame rate is READ+PROCESS+WRITE latency – e.g. 10 fps for 100 milliseconds – If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK

20 msec, predominate time is IO time, not processing – Disk drive with 100 MB/sec READ rate can only read 16 fps,

62.5 msec READ latency

Sam Siewert 15

READ F(1) Process F(1) Write-back F(1) READ F(2)

Hiding IO Latency

Schedule Multiple Overlapping Threads? Requires Nthreads = Nstages x Ncores 1.5 to 2x Number of Threads for SMT (Hyper-threading) For IO Stage Duration Similar to Processing Time More Threads if IO Time (Read+WB+Read) >> 3 x Processing Time Sam Siewert 16

READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4

READ F2 Process F2 Write-back F2 READ F5 Process F5 …

READ F3 Process F3 Write-back F3 Read F6 …

Start-up Core #1 Continuous Processing Core #1 Continuous Processing

READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4

READ F2 Process F2 Write-back F2 READ F5 Process F5 …

READ F3 Process F3 Write-back F3 Read F6 …

Start-up Core #2 Continuous Processing Core #2 Continuous Processing

Hiding Latency – Dedicated IO

Schedule Reads Ahead of Processing

Requires Nthreads = 2 + Ncores

Synchronize Frame Ready/Write-backs Balance Stage Read/Write-Back Latency to Processing 1.5 to 2x Threads for SMT (Hyper-threading) Sam Siewert 17

Wait Process F1 Process F3 Process F5 …

Wait Process F2 Process F4 Process F6

Read F1 Read F2 Read F3 Read F4 Read F5 Read F6 Read F7 Read F8

Start-up

Wait … WB F1 WB F2 WB F3 WB F4 WB F5 WB F6

Dual-Core Concurrent Processing Completion

Processing Latency Alone Write Code with Memory Resident Frames – Load Frames in Advance – Process In-Memory Frames Over and Over – Do No IO During Processing – Provides Baseline Measurement of Processing Latency per

Frame Alone – Provides Method of Optimizing Processing Without IO Latency

Sam Siewert 18

IO Latency Alone Comment Out Frame Transformation Code or Call Stubbed NULL Function – Provides Measurement of IO Frame Rate Alone – Essentially Zero Latency Transform – No Change Between Input Frames and Output Frames – Allows for Tuning of IO Scheduler and Threading

Sam Siewert 19

Tips for IO Scheduling

blockdev --getra /dev/sda – Should return 256 – Means that reads read-ahead up to 128K – Function calls – read, fread should request as much as possible – Check “actual bytes read”, re-read as needed in a loop

blockdev --setra /dev/sda 16384 (8MB) Switch CFQ to Deadline – Use “lsscsi” to verify your disk is /dev/sda … substitue block

driver interface used for file system if not sda – cat /sys/block/sda/queue/scheduler – echo deadline > /sys/block/sda/queue/scheduler

Options are noop, cfq, deadline

Sam Siewert 20

cs a320 operating systems for engineers - …ssiewert/a320_doc/lectures/lecture-week... · fast...

Documents