documentm1

184
Version 3.10 LISA 2006 1 ©1994-2006 Hal Stern, Marc Staveley System & Network Performance Tuning Hal Stern Sun Microsystems Marc Staveley SOMA Networks This tutorial is copyright 1994-1999 by Hal L. Stern and 1998-2006 by Marc Staveley. It may not be used in whole or part for commercial purposes without the express written permission of Hal L. Stern and Marc Staveley. Hal Stern is a Distinguished Systems Engineer at Sun Microsystems. He was the System Administration columnist for SunWorld from February 1992 until April 1997, and previous columns and commentary are archived at: . Hal can be reached at . Marc Staveley is the Director of IT for SOMA Networks Inc. He is a frequent speaker on the topics of standards-based development, multi-threaded programming, system administration and performance tuning. Marc can be reached at Some of the material in the Notes sections has been derived from columns and articles first appearing in SunWorld, Advanced Systems and SunWorld Online. Hal thanks IDG and Michael McCarthy for their flexibility in allowing him to retain the copyrights to these pieces. Rough agenda: 9:00 - 10:30 AM Section 1 11:00 - 12:30 PM Section 2 1:30 - 3:00 PM Section 3 3:30 - 5:00 PM Section 4

Upload: savio77

Post on 16-Apr-2017

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DocumentM1

Version 3.10 LISA 2006 1

©1994-2006 Hal Stern, Marc Staveley

System & Network

Performance Tuning

Hal SternSun Microsystems

Marc StaveleySOMA Networks

This tutorial is copyright 1994-1999 by Hal L. Stern and 1998-2006 by Marc Staveley. It may not be used in whole or part for commercial purposes without the express written permission of Hal L. Stern and Marc Staveley.

Hal Stern is a Distinguished Systems Engineer at Sun Microsystems. He was the System Administration columnist for SunWorld from February 1992 until April 1997, and previous columns and commentary are archived at: http://www.sun.com/sunworldonline.

Hal can be reached at [email protected].

Marc Staveley is the Director of IT for SOMA Networks Inc. He is a frequent speaker on the topics of standards-based development, multi-threaded programming, system administration and performance tuning.

Marc can be reached at [email protected]

Some of the material in the Notes sections has been derived from columns and articles first appearing in SunWorld, Advanced Systems and SunWorld Online. Hal thanks IDG and Michael McCarthy for their flexibility in allowing him to retain the copyrights to these pieces.

Rough agenda:

9:00 - 10:30 AM Section 111:00 - 12:30 PM Section 2 1:30 - 3:00 PM Section 3 3:30 - 5:00 PM Section 4

Page 2: DocumentM1

Version 3.10 LISA 2006 2

©1994-2006 Hal Stern, Marc Staveley

Syllabus

· Tuning Strategies & Expectations

· Server Tuning

· NFS Performance

· Network Design, Capacity Planning &

Performance

· Application Architecture

Some excellent books on the topic:

Raj Jain, Computer System Performance (Wiley)

Mike Loukides, System Performance Tuning (O'Reilly)

Adrian Cockcroft and Richard Pettit, Sun Performance and Tuning, Java and the

Internet (SMP/PH)

Craig Hunt, TCP/IP Network Administration (O'Reilly)

Brian Wong, Configuration and Capacity Planning for Solaris Servers

(SunSoft/PH)

Richard Mc Dougall et al. Sun Blueprints: Resource Management (SMP/PH)

Some Web resources:

Solaris Tunable Parameters Reference Manual

(http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters/)

Solaris 2 - Tuning Your TCP/IP Stack and More (http://www.sean.de/Solaris)

Page 3: DocumentM1

Version 3.10 LISA 2006 3

©1994-2006 Hal Stern, Marc Staveley

Tutorial Structure

· Background and internals

- Necessary to understand user-visible symptoms

· How to play doctor

- Diagnosing problems

- Rules of thumb, upper and lower bounds

· Kernel tunable parameters

- Formulae for deriving where appropriate

If you take only two things from the whole tutorial, they should be:

- Disk configuration matters

- Memory matters

Page 4: DocumentM1

Version 3.10 LISA 2006 4

©1994-2006 Hal Stern, Marc Staveley

Tuning Strategies &

Expectations

Section 1

Page 5: DocumentM1

Version 3.10 LISA 2006 5

©1994-2006 Hal Stern, Marc Staveley

Topics

· Practical goals

- Terms & conditions

- Workload characterization

· Statistics and ratios

- Monitoring intervals

· Understanding diagnostic output

Page 6: DocumentM1

Version 3.10 LISA 2006 6

©1994-2006 Hal Stern, Marc Staveley

Practical Goals

Section 1.1

Page 7: DocumentM1

Version 3.10 LISA 2006 7

©1994-2006 Hal Stern, Marc Staveley

Why Is This Hard?

Business

Transaction

Database

Transaction

Transaction Monitor

DBMS Organization

SQL Optimizer

System

CPU

Network

Latency

Disk

I/O

User

CPU

increasing

loss of

correlation

decreasing

ease of

measurement

The problem with un-correlated inputs and measurements is akin to that of

driving a car while blindfolded: the passenger steers while elbowing you to

work the gas and brakes. When your reflexes are quick, you can manage, but if

you misinterpret a signal, you end up rebooting your car.

Correlating user work with system resources is what Sun's Dtrace and

FreeBSD's ktrace attempt to do.

Page 8: DocumentM1

Version 3.10 LISA 2006 8

©1994-2006 Hal Stern, Marc Staveley

Social Contract Of Administration

· Why bother tuning?

- Resource utilization, purchase plans, user outrage

· Users want 10x what they have today

- sound and video today, HDTV tomorrow

- Simulation and decision support capabilities

· Application developers should share

responsibility

- Who owns educational process?

- Performance and complexity trade-off

- Load, throughput and cost evaluations

System administrators today are playing a difficult game of perception

management. Hardware prices have declined to the point where most

managers believe you can get Tandem-like fault tolerance at PC prices with no

additional software, processes or disciplines. Much of this tutorial is about

acquiring, enforcing and insisting on discipline.

Page 9: DocumentM1

Version 3.10 LISA 2006 9

©1994-2006 Hal Stern, Marc Staveley

Tuning Potential

· Application architecture: 1,000x

- SQL, query optimizer, caching, system calls

· Server configuration: 100x

- Disk striping, eliminate paging

· Application fine-tuning: 2-10x

- Threads, asynchronous I/O

· Kernel tuning: less than 2x on tuned system

- If kernel bottleneck is present, then 10-100x

- Kernel can be a binary performance gate

Here are some "laws" of the computing realm compared:

Moore's law predicts a doubling of CPU horsepower every 18 months, so that

gives us about a 16x improvement in 6 years.

If you look at reported transaction throughput for Unix database systems,

though, you'll see a 100x improvement in the past 6 years -- there's more than

just compute horsepower at work. What we've measured is the result of

operating systems, disks, parallelism, bus throughput and improved

applications.

An excellent discussion of "rules of thumb" as a consequence of Moore's Law is

found in Gray and Shenoy's Rules of thumb in data engineering, Microsoft

Research technical report MS-TR-99-100, Feb. 2000.

Page 10: DocumentM1

Version 3.10 LISA 2006 10

©1994-2006 Hal Stern, Marc Staveley

Practical Tuning Rules

· There is no "ideal" state in a fluid world

· Law of diminishing returns

- Early gains are biggest/best

- More work may not be cost-effective

· Negativism prevails

- Easy to say "This won't work"

- Hard to prove configuration can deliver on goals

· Headroom for well-tuned applications?

- Good tuning job introduces new demands

· Kaizen

Page 11: DocumentM1

Version 3.10 LISA 2006 11

©1994-2006 Hal Stern, Marc Staveley

Terminology: Bit Rates

· Bandwidth

- Peak of the medium, bus: what's available

- Easy to quote, hard to reach

· Throughput

- What you really get: useful data

- Protocol dependent

· Utilization

- How much you used

- Not just throughput/bandwidth

- 100% utilized with useless data: collisions

Bandwidth => Utilization => Throughput

Each measurement shows a slight (or sometimes great) loss over the previously

ordered metric.

Formal definitions:

Bandwidth: the maximum achievable throughput under ideal workload

conditions (nominal capacity)

Throughput: rate at which the requests can be serviced by the system.

Utilization: the fraction of time the resource is busy servicing requests.

Page 12: DocumentM1

Version 3.10 LISA 2006 12

©1994-2006 Hal Stern, Marc Staveley

Terminology: Time

· Latency

- How long you wait for something

· Response time

- What user sees: system as a black box

- Standard measuresTPC-C: transactions per minuteTPC-D: queries per hour

· Bad Things

- Knee, wall, non-linear Load

Throughput

Knee capacityUsable capacity

Page 13: DocumentM1

Version 3.10 LISA 2006 13

©1994-2006 Hal Stern, Marc Staveley

Example

· Bandwidth to NYC

- 10 lanes x 5 cars/s x 4 people/car = 200 pps

· Throughput

- 1 person/car (bad protocol), 1-2 cars/s (congestion)

- Parking delays (latency)

· How to fix it

- Increase number of lanes (bandwidth)

- More people per vehicle (efficient protocol)

- Eliminate toll (congestion)

- Better parking lots (reduce latency)

Tolls add to latency (since you have to stop and pay them) and also to

congestion when traffic merges back into a few lanes. Congestion from traffic

merges is another form of increased latency.

Now consider this: You wire your office with 100baseT to the desktops, feeding

into 1000baseT switched Ethernet hubs. If you run 16 desktops into each

switch,, you're merging 16 * 100 = 1600 Mbits/sec into a 1000 Mbits/sec

"tunnel".

Page 14: DocumentM1

Version 3.10 LISA 2006 14

©1994-2006 Hal Stern, Marc Staveley

Unit Of Work Paradox

· Unit of work is the typical "chunk size" for

- Network traffic

- Disk I/O

· Small units optimized for response time

- Network transfer latency, remote processing

· Large units optimized for protocol efficiency

- Compare ftp (~4% waste) & telnet (~90% waste)

- Ideal for large transfers like audio, video

· Where does ATM fit?

ATM uses fixed-size cells, making it ideal for audio and video that need to be

optimized for response time. Unfortunately, the cells are very small (48 bytes of

payload) so ATM incurs a large processing overhead for transfers involving

large files, like audio or video clips.

Page 15: DocumentM1

Version 3.10 LISA 2006 15

©1994-2006 Hal Stern, Marc Staveley

Workload Characterization

· What are the users (processes) doing?

- Estimating current & future performance

- Understanding resource utilization

· Fixed workloads

- Easy to characterize & project

· Random workloads

- Take measurements, look at facilities over time

- Tools & measurements intervals

Page 16: DocumentM1

Version 3.10 LISA 2006 16

©1994-2006 Hal Stern, Marc Staveley

Completeness Counts

· Random or sequential access?

- Koan of this tutorial

· Don't say: 1,000 NFS requests/second

- Read/write and attribute browsing mix?

- Average file size and lifetime?

- Working set of files?

· Don't say: 400 transactions/second

- Lookup, insert, update mix?

- Indexes used?

- Checkpoints, logs, 2-phase commit?

Page 17: DocumentM1

Version 3.10 LISA 2006 17

©1994-2006 Hal Stern, Marc Staveley

Statistics & Ratios

Section 1.2

Page 18: DocumentM1

Version 3.10 LISA 2006 18

©1994-2006 Hal Stern, Marc Staveley

Useful Metrics

· Latency over utilization

- Loaded resources may be sufficient

- What does the user see?

· Peak versus average load

- How system reacts under crunch

- What are new failure modes at peaks?

· Time to:

- Recover, repair, rebuild from faults

- Accommodate new workflow

· Managing applications

Page 19: DocumentM1

Version 3.10 LISA 2006 19

©1994-2006 Hal Stern, Marc Staveley

Recording Intervals

· Instantaneous data rarely useful

- Computer and business transactions long-lived

- Smooth out spikes in small neighbourhoods

· Long-term averages aren't useful either

- Peak demand periods disappear

- Can't tie resources to user functions

· Combine intervals

- 5-10 seconds for fine-grain work (OLTP)

- 10-30 seconds for peak measurement

- 10-30 minutes for coarse-grain activity (NFS)

Page 20: DocumentM1

Version 3.10 LISA 2006 20

©1994-2006 Hal Stern, Marc Staveley

Nyquist Frequency

· Same total load between B and D

- Peaks are different at C

- Sampling frequency determines accuracy

· Nyquist frequency is >2x "peak cycle"

- Peaks every 5 min, sample every 2.5 min

A B C D E

The total area under the two curves is about the same from "B" to "D". If you

simply measure at these endpoints and take an average, you'll think the two

loads are the same, and miss the peaks. If you measure at twice the frequency of

the peaks -- "B", "C" and "D", you'll see that peak demand is greater than the

average on the green-lined system.

The Nyquist theorem: to reconstruct a sampled input signal accurately,

sampling rate must be greater than twice the highest frequency in the input

signal.

The Nyquist frequency: the sampling rate / 2

Page 21: DocumentM1

Version 3.10 LISA 2006 21

©1994-2006 Hal Stern, Marc Staveley

Normal Values

· Maintain baselines

- "But it was faster on Tuesday!"

- Distinguish normal and threshold-crossing states

- Correlate to type of work being done (user model)

· Scalar proclamations aren't valuable

- CPU load without application knowledge

- Disk I/O traffic without access patterns

- Memory usage without cache hit data

Page 22: DocumentM1

Version 3.10 LISA 2006 22

©1994-2006 Hal Stern, Marc Staveley

Effective Ratios

· Find relationships between work and resources

- Units of work: NFS operations, DB requests

- Units of management: disks, memory, network

· Use correlated variables

- Or ratios are just randomly scaled samples

· Measure something to be improved

- Bad example: Bugs/lines of code

- Good example: collisions/packet size

· Confidence intervals

- Sensitivity of ratio & error bars (accuracy)

Be sure you can control granularity of the denominator. That is, you shouldn't

be able to cheat by increasing the denominator and lowering a cost-oriented

ratio, showing false improvement. Bugs per line of code is a miserable metric

because the code size can be inflated. Quality is the same but the metric says

you've made progress.

The accuracy of a ratio is multiplied by its sensitivity - a small understatement in

a ratio that grows superlinearly with its denominator turns into a large error.

When you multiply two inexact numbers, you also multiply their errors

together. Looking at 50 I/O operations per second, plus or minus 5 Iops is

reasonable, but 50 Iops plus or minus 45 Iops is the same as taking a guess.

The Arms index, named for Richard Arms, is sometimes called the TRIN

(Trading Index). It's a measure of correlation between the price and volume

movements of the NYSE. Instead of looking at up stocks/down stocks or up

volume/down volume, the Arms index computes

(up stocks/down stocks) / (up vol/down vol)

When the index is at 1.0, the up and down volumes reflect the number of issues

moving in each direction. An index of 0.5 means advancing issues have twice

the share volume of decliners (strong); an index over 2.0 means the decliners are

outpacing the gainers on a volume basis.

Page 23: DocumentM1

Version 3.10 LISA 2006 23

©1994-2006 Hal Stern, Marc Staveley

Understanding

Diagnostic Output

Section 1.3

Page 24: DocumentM1

Version 3.10 LISA 2006 24

©1994-2006 Hal Stern, Marc Staveley

General Guidelines

· Use whatever works for you

- Make sure you understand output format & scaling

- Know inconsistencies by platform & tool

· Ignore the first line of output

- Average since system was booted

- Interval data is more important

· Accounting system

- Source of accurate fine-grain data

- Need to turn on on most systems

Process accounting gives you detailed break-downs of resource utilization,

including the number of system calls, the amount of CPU used, and so on. This

adds at most a few percent to system overhead. While accounting can be about

5% in worst case, auditing (used for security and fine-grain access control) adds

between 10-20% overhead. Auditing tracks every operation from a user process

into the kernel.

If your system stays up for a long (100 days or more) period of time, you may

find some of the counters wrap around their 31-bit signed values, producing

negative reported values.

Page 25: DocumentM1

Version 3.10 LISA 2006 25

©1994-2006 Hal Stern, Marc Staveley

Standard UNIX System Tools

· vmstat, sar

- Memory, CPU and system (trap) activity

- sar has more detail, histories

- vmstat uses KB, sar uses pages

· iostat

- Disk I/O service time and operation workhorse

· nfsstat

- Client and server side data

· netstat

- TCP/IP stack internals

� pflags, pcred, pmap, pldd, psig, pstack, pfiles, pwdx, pstop, prun, pwait,

ptree, ptime: (Solaris) display various pieces of information about process(es)

in the system.

� mpstat: (Solaris, Linux): per-processor statistics, e.g. faults, inter-processor

cross-calls, interrupts, context switches, thread migrations etc.

� top (all), prstat (Solaris): show an updated view of the process in the system.

� memtool (Solaris <=8): everything you ever wanted to know about the

memory usage in a Solaris box [http://playground.sun.com/pub/memtool]

� mdb::memstat (Solaris >=9): same info as memtool

� Lockstat, Dtrace (Solaris >=10): what are the processes and kernel really

doing?

� setoolkit (Solaris, and soon others): virtual performance experts

[http://www.setoolkit.com]

� kstat (Solaris): display kernel statistics

� RRDB/ORCA/Cricket/MRTG/NRG/Smokeping/HotSaNIC/OpenNMS:

performance graphing tools

� HP Perfview: part of OpenVIEW

Page 26: DocumentM1

Version 3.10 LISA 2006 26

©1994-2006 Hal Stern, Marc Staveley

Accounting

· 7 processes running on a loaded system

- top or ps show "cycling" of processes on CPUs

- Which one is the pig in terms of user CPU, system calls, disk I/O initiated?

· Accounting data shows per-process info

- Memory

- CPU

- System calls

Turn on Accounting - Mike Shaprio (Distinguished Engineer at Sun, and all round kernel guru) claims tht the overhead of accounting is low. The kernel always collects the data, you just pay the I/O overhead to write it to disk.

Page 27: DocumentM1

Version 3.10 LISA 2006 27

©1994-2006 Hal Stern, Marc Staveley

Output Interpretation: vmstat

% vmstat 5

procs memory page disk faults cpu

r b w free re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id

1 0 0 1788 0 1 36 0 0 16 0 16 0 0 0 42 105 297 45 14 41

3 0 0 2000 0 1 60 0 0 0 0 20 0 0 0 83 197 226 38 45 18

- procs - running, blocked, swapped

- fre - free memory (not process, kernel, cache)

- re - reclaims, page freed but referenced

- at - attaches, page already in use (ie, shared library)

- pi/po - page in/out rates

- fr - page free rate

- sr - paging scan rate

Always, always drop the first line of output from system tools like vmstat. It

reflects totals/averages since the system was booted, and isn't really meaningful

data (certainly not for debugging).

You'll see the fre column start high - close to the total memory in the system -

and then sink to about 5% of the total memory over time, in systems like Solaris

(<= 2.6), Irix and other V.4 variants. This is due to file and process page caching,

and is perfectly normal.

Page 28: DocumentM1

Version 3.10 LISA 2006 28

©1994-2006 Hal Stern, Marc Staveley

Interpretation, Part 2

% vmstat 5procs memory page disk faults cpur b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id1 0 0 1788 0 1 36 0 0 16 0 16 0 0 0 42 105 297 45 14 413 0 0 2000 0 1 60 0 0 0 0 20 0 0 0 83 197 226 38 45 18

- disk - disk operations/sec, use iostat -D

- in - interrupts/sec, use vmstat -i

- sy - system calls/sec

- cs - context switches/sec

- us - % CPU in user mode

- sy - % CPU in system mode

- id - % CPU idle time

- swap (Solaris) - amount of swap space used

- mf (Solaris) - minor fault, did not require page in (zero fill on demand, copy on write, segmentation or bus errors)

Zero fill on demand (ZFOD) pages are paged in from /dev/zero, and produce

(as you would expect) a page filled with zeros, quite useful for the initialized

data segment of a process

Page 29: DocumentM1

Version 3.10 LISA 2006 29

©1994-2006 Hal Stern, Marc Staveley

Example #1

procs memory page disk faults cpu

r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id

2 0 0 1788 0 1 36 0 0 0 0 6 0 0 0 42 45 297 97 2 1

3 0 0 2000 0 1 60 0 0 0 0 2 0 0 0 83 97 226 94 4 2

·High user time, little/no idle time

·Some page-in activity due to filesystem reads

·Application is CPU bound

Page 30: DocumentM1

Version 3.10 LISA 2006 30

©1994-2006 Hal Stern, Marc Staveley

Example #2

procs memory page disk faults cpu

r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id

3 11 0 1788 0 0 34 0 0 0 0 24 10 0 0 34 272 310 25 58 17

3 10 0 2000 0 0 30 0 0 0 0 14 12 0 0 35 312 340 26 55 19

·Heavy disk activity resulting from system calls

·Heavy system CPU utilization, but still some idle

time

·Database or web server with badly tuned disks

·Lower system call rate implies NFS server, same

problems

System calls can "cause" interrupts (when I/O operations complete), network

activity, and disk activity. A high volume of network inputs (such as NFS traffic

or http requests) can cause the same effects, so it's important to dig down

another level to find the source of the load.

Page 31: DocumentM1

Version 3.10 LISA 2006 31

©1994-2006 Hal Stern, Marc Staveley

Example #3

procs memory page disk faults cpu

r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id

3 0 0 1788 0 0 4 0 0 0 0 1 0 0 0 534 10 25 15 80 5

2 0 0 2000 0 0 3 0 0 0 0 1 0 0 0 515 12 30 15 83 2

· High interrupt rate without disk or system call

activity

· Implies network, serial port or PIO device

generating load

· Host acting as router, unsecured tty port or a

nasty token ring card

Page 32: DocumentM1

Version 3.10 LISA 2006 32

©1994-2006 Hal Stern, Marc Staveley

Example #4

procs memory page disk faults cpu

r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id

3 3 0 1788 0 12 54 30 60 0 100 53 0 0 0 64 110 105 15 10 75

2 4 0 2000 0 10 43 28 58 0 110 41 0 0 0 60 112 130 12 10 78

· Page-in/page-out and free rates indicate VM

system is busy

· High idle time from waiting on disk

· Paging/swapping to root disk (primary swap

area)

· Machine is memory starved

Page 33: DocumentM1

Version 3.10 LISA 2006 33

©1994-2006 Hal Stern, Marc Staveley

Server TuningA single machine

(works for desktops too)

Section 2

Page 34: DocumentM1

Version 3.10 LISA 2006 34

©1994-2006 Hal Stern, Marc Staveley

Topics

· CPU utilization

· Memory consumption & paging space

· Disk I/O

· Filesystem optimizations

· Backups & redundancy

Page 35: DocumentM1

Version 3.10 LISA 2006 35

©1994-2006 Hal Stern, Marc Staveley

Tuning Roadmap

· Eliminate or identify CPU shortfall

· Reduce paging and fix memory problems

· Balance disk load

- Volume management

- Filesystem tuning

- Backups & integrity planning

Do steps in this order.

Page 36: DocumentM1

Version 3.10 LISA 2006 36

©1994-2006 Hal Stern, Marc Staveley

CPU Utilization

Section 2.1

Page 37: DocumentM1

Version 3.10 LISA 2006 37

©1994-2006 Hal Stern, Marc Staveley

Where Do The Cycles Go?

· > 90% user time

- Tune application code, parallelize

· > 30% system time

- User-level processes: system programming

· Kernel-level work consumes system time

- NFS, DBMS calls, httpd calls, IP routing/filtering

· NIS, DNS (named), httpd are user-level

- High system-level CPU without corresponding user-level CPU is unusual in these configurations

Perhaps the best tool for quickly identifying CPU consumers is top.

top is a graphical version of ps that runs on every Unix variant known.

A high system CPU % on an NIS or DNS server could indicate that the machine

is also acting as a router, or handling other network traffic.

Page 38: DocumentM1

Version 3.10 LISA 2006 38

©1994-2006 Hal Stern, Marc Staveley

Idle Time

· > 10% idle

- I/O bound, tune disks

- Input bound, tune network

· %wait, %wio are for disk I/O only

- Network I/O shows up as idle time

- RPC, NIS, NFS are not I/O waits

One possibility for high idle time is that the system is really doing nothing. This

is fine if you aren't running any jobs, but if you are expecting input and aren't

getting it, it's time to look away from the client/server and at the network. The

client trying to send on the network will show a variety of network contention &

latency problems, but the server will appear to be idle.

Page 39: DocumentM1

Version 3.10 LISA 2006 39

©1994-2006 Hal Stern, Marc Staveley

Multiprocessor Systems

· vmstat, sar show averages

· Example: 25% user time on 4-way host

- 4 CPUs at 25% each

- 2 CPUs at 50% each, 2 idle

- 1 CPU at 100%, 3 idle

· Apply rules on per-CPU basis

· System-specific tools for breakdown

- mpstat, psrinfo (Solaris 2.x)

Page 40: DocumentM1

Version 3.10 LISA 2006 40

©1994-2006 Hal Stern, Marc Staveley

A Puzzle

· Server system with framebuffer behaves well

(mostly)

· Periodically experiences major slowdown

- File service slows to crawl

- User and system CPU total near 100%

· Can never find problem on console; problem

disappears when monitoring begins

Page 41: DocumentM1

Version 3.10 LISA 2006 41

©1994-2006 Hal Stern, Marc Staveley

Controlling CPU Utilization

· Process "pinning"

- Maintain CPU cache warmth

- Cut down on MP bus/backplane traffic

- Unclear effects for multi-threaded processes

· Resource segregation

- Scheduler tables

- Process serialization

- Memory allocation

- E10K domains

· OS may do a better job than you do!

"pinning" in Solaris may be done with the "psr" commands: psrset, psrinfo.

Page 42: DocumentM1

Version 3.10 LISA 2006 42

©1994-2006 Hal Stern, Marc Staveley

Process Serialization

· Multiple user processes: good MP fit

- Memory, disk must be sufficient

· Resource problems

- # jobs > # CPUs

- sum(memory) > available memory

- Cache thrashing (VM or CPU)

· Resource management to the rescue

The key win of using a batch scheduler is that it controls usage of memory and

disk resources as well. Even if you're not CPU bound, a job scheduler can

eliminate contention for memory (discussed later) by controlling the total

memory footprint of jobs that are runnable at any one time. When you're short

on memory, 2 x 1/2 isn't 1; it's more like 0.5

Page 43: DocumentM1

Version 3.10 LISA 2006 43

©1994-2006 Hal Stern, Marc Staveley

Resource Management

· Job Scheduler: serialization

- Batch queue system

- Line up jobs for CPUs like bank tellers

- Manage total memory footprint

· Batch Scheduler: prioritization

- Modifies scheduler to only let some jobs run when system is idle

· Fair Share Scheduler: parallelization

- Gives groups of processes "shares" of memory and CPU

Your goal with the job scheduler is to reduce the average wait time for a job. If

the typical time to complete is 5 minutes for a job when, say, 5 jobs run in

parallel, then you should try getting the average completion time down into the

1 1/2 to 3 minute range by freeing up resources for each job to run as fast as

possible. Even though the jobs run serially, the expected time to completion is

lower when each job finishes more quickly.

A batch scheduler for Solaris is available from Sun PS's Customer Engineering

group

An example of one produced using the System V dispatch table is described in

SunWorld, July 1993

Page 44: DocumentM1

Version 3.10 LISA 2006 44

©1994-2006 Hal Stern, Marc Staveley

Context Switches

· What is a context switch? (cs or ctx)

- New runnable thread (kernel or user) gets CPU

- Rates vary on MP systems

· Causes

- Single running process yields to scheduler

- Interrupt makes another process runnable

- Process waits on I/O or event (signal)

· A symptom, not a problem

- With high interrupt rates: I/O activity

- With high system call rates: bad coding

Page 45: DocumentM1

Version 3.10 LISA 2006 45

©1994-2006 Hal Stern, Marc Staveley

Traps and System Calls

· What is a trap?

- User process requests operating system help

· Causes

- System call, page fault (common)

- Floating Point Exception

- Unimplemented instructions

- Real memory errors

· Less common traps are cause for alarm

- Wrong version of an executable

- Hardware troubles

Version mismatches:

SPARC V7 has no integer multiply/divide

SPARC V8 has imul/idiv, and optimized code uses it. When run on a SPARC

V7 machine, each imul generates an unimplemented instruction trap, which the

kernel handles through simulation, using the same user-level code the compiler

would have inserted for a V7 chip.

Symptoms of this problem: very high trap rate (on the order of thousands per

second, or about one per arithmetic operation) but no system calls. Normally, a

high trap rate is coupled with a high system call rate -- the system calls generate

traps to get the kernel's attention.

Page 46: DocumentM1

Version 3.10 LISA 2006 46

©1994-2006 Hal Stern, Marc Staveley

Memory Consumption

& Paging (Swap) Space

Section 2.2

Page 47: DocumentM1

Version 3.10 LISA 2006 47

©1994-2006 Hal Stern, Marc Staveley

Page Lifecycle

· Page creation: at boot time

· Page fills

- From file: executable, mmap()

- From process: exec()

- Zero Fill On Demand (zfod): /dev/zero

· Page uses

- Kernel and its data structures

- Process text, data, heap, stack

- File cache

· Pages backed by filesystem or paging (swap)

space

/dev/zero is the "backing store" for zero-filled pages. It produces an endless

stream of zeros -- you can map it, read it, or cat it, and you get zeros.

/dev/null is a bottomless sink – you write to it and the data disappears.

Reading from /dev/null produces an immediate end of file, not pages of

zeros.

Page 48: DocumentM1

Version 3.10 LISA 2006 48

©1994-2006 Hal Stern, Marc Staveley

Filesystem Cache

· System V.4 uses main memory

- Systems run with little free memory

- Available frames used for files

· Side effects

- Some page freeing normal

- All filesystem I/O is page in/out

· Solaris (>= 8)

- Uses the cyclic page cache for filesystem pages� filesystem cache lives on the free list

Page 49: DocumentM1

Version 3.10 LISA 2006 49

©1994-2006 Hal Stern, Marc Staveley

Paging (Swap) Space & VM

· BSD world

- Total VM = swap plus shared text segments

- Must have swap at least as large as memory

- Can run out of swap before memory

· Solaris world

- Total VM = swap + physical memory - N pages

- Can run swap-less

- Swap = physical memory "overage"

· Running out of swap

- EAGAIN, no more memory, core dumps

If you run swapless, you cannot catch a core dump after a reboot (since there is

no space for the core dump to be written).

Page 50: DocumentM1

Version 3.10 LISA 2006 50

©1994-2006 Hal Stern, Marc Staveley

Estimating VM Requirements

· Look at output of ps command

- RSS: resident set size, how much is in memory� Total RSS field for lower bound

- SZ: stack and data, backed by swap space� Total SZ field for upper bound, good first cut

· Memory leaks

- Processes grow� SZ increases

- Examine use of malloc()/free()

- Will exhaust paging space� may hang system

Under Solaris, you can use the memtool package to estimate VM requirements

(http://playground.sun.com/pub/memtool).

Memory leaks are covered in more detail in Section 5, as an application problem.

Your first indication that you have an issue is when you notice VM problems,

which should point back to an application problem, so we mention it here first.

Page 51: DocumentM1

Version 3.10 LISA 2006 51

©1994-2006 Hal Stern, Marc Staveley

Paging Principles

· Reclaim pages when memory runs low

- Start running pagedaemon (pid 2)

· Crisis avoidance

- Guiding principle of VM system

· Page small groups on demand

- Keep small pool free

- Swap to make large pools available

- Compare 200M swapped out in one step to 64M paged out in 16,000 steps

Page 52: DocumentM1

The hands of the "clock" sweep through the page structures at the same rate,

at a fixed distance apart (handspreadpages).

If the fronthand encounters a page whose reference bit is on, it turns the bit

off. When the backhand looks at the page later, it checks the bit. If the bit is

still off, nothing referenced this page since the fronthand looked at it. The

page may move onto the page freelist (or written to swap).

The rate at which the hands sweep through the page structures varies

linearly with the amount of free memory. If the amount of free memory is

lotsfree, the hands move at a minimum scan rate, slowscan. As the

amount of free memory approaches 0, the scan rate approaches fastscan.

Handspreadpages – determines the amount of time an application has to

touch a page before it will be stolen for the free list.

Version 3.10 LISA 2006 52

©1994-2006 Hal Stern, Marc Staveley

VM Pseudo-LRU Analysis

· pagedaemon runs every 1/4 second

- Runs for 10 msec to "sweep a bit"

· Clock algorithm

- Pages arranged in logical circle

Backhand

Fronthand

han

dsp

read

"

Page 53: DocumentM1

Version 3.10 LISA 2006 53

©1994-2006 Hal Stern, Marc Staveley

VM Thresholds (Solaris >= 2.6)

· Lotsfree: defaults to 1/64 of memory

- Point at which paging starts

- Up to 30% of memory (not enforced)

· desfree: panic button for swapper

- ½ lotsfree

· minfree: unconditional swapping

- ½ desfree

- low water mark for free memory

Page 54: DocumentM1

Version 3.10 LISA 2006 54

©1994-2006 Hal Stern, Marc Staveley

VM Thresholds (cont.)

· cachefree

- Solaris 2.6 (patch 105181-10) and Solaris 7� Not Solaris >= 8

- lotsfree * 2

- page scanner looks for unused pages that are not

claimed by executables (file system buffer cache pages)

· cachefree > lotsfree > desfree > minfree

- Strict ordering

- lotsfree-desfree gap should be big enough for a typical process creation or malloc(3) request.

If priority_paging=1 is set in /etc/system then cachefree is set to twice

lotsfree (otherwise cachefree == lotsfree), and slowscan moves to

cachefree (see next slide).

10% to 300% Desktop performance increase

Not clear if it is any good for servers, depends on the type. Typically not good for file servers.

Page 55: DocumentM1

Version 3.10 LISA 2006 55

©1994-2006 Hal Stern, Marc Staveley

VM Thresholds in action

Slowscan

Fastscan

minfree

desfree

lotsfree

100

8192

4MB 8MB 16MB

Free Memory

Scan Rate

cachefree

32MB

minfree is needed to allow "emergency" allocation of kernel data structures such

as socket descriptors, stacks for new threads, or new memory/VM system

structures. If you dip below minfree, you may find you can't open up new

sockets (and you'll see EAGAIN errors at user level).

The speed at which you crash through lotsfree toward minfree is driven by the

demand for memory. The faster you consume memory, the more headroom you

need above minfree to allow the system to absorb the new demand.

Solaris >= 2.6

fastscan = min( ½ mem, 64 MB)

slowscan = min( 1/20 mem, 100 pages)

handspreadpages = fastscan

Therefore all of memory is scanned in 2 (20) seconds at fastscan (slowscan) and

an application has 1 (10) seconds to reference a page before it will be put on the

free list [for a 128MB machine, like they still exist...]

Page 56: DocumentM1

Version 3.10 LISA 2006 56

©1994-2006 Hal Stern, Marc Staveley

Sweep Times

· Time required to scan all of memory� physmem/fastscan lower bound

� physmem/slowscan upper bound

· Shortest window for pages to be touched� handspreadpages/fastscan

· Application-dependent tuning

- Increase handspread, especially on large memory machines

- Match LRU window (coarsely) to transaction duration

As an example of an upper bound on the scanning time: consider slowscan at

100 pages/second, and a 640M machine with a 4K pagesize. That's 160K pages,

meaning a full memory scan will take 1600 seconds. Crank up the value of

fastscan to reduce the round-trip scanning time if required

The output of vmstat -S shows you how many "revolutions" the clock hands

have made. If you find the system spinning the clock hands you may be

working too hard to free too few pages of memory.

Some tuning may help for large, scientific applications that have peculiar or

very well-understood memory traffic patterns. Sequential access, for example,

benefits from faster "free behind"

Servers (systems doing lots of filesystem I/O) should set fastscan large (131072

[8KB] pages = 1GB/second)

Page 57: DocumentM1

Version 3.10 LISA 2006 57

©1994-2006 Hal Stern, Marc Staveley

Activity Symptoms

· Scan rate (sr), free rate (fr)

- Progress made by pagedaemon

· Pageouts (po)

- Page kicked out of memory pool, file write

· Pagein (pi)

- Page fault, filled from text/swap, file read

· Reclaim (re)

- Waiting to go to disk, brought back

· Attach (at)

- Found page already in cache (shared libraries)

If you see the scan rate (sr) and the free rate (fr) about equal, this means the

virtual memory system is releasing pages as fast as it's scanning them. Most

probably, the least-recently used algorithm has degenerated into "last scanned",

meaning that tuning the handspread or the scan rates may improve the page

selection process.

Page 58: DocumentM1

Version 3.10 LISA 2006 58

©1994-2006 Hal Stern, Marc Staveley

VM Problems

· Basic indicator: scanning and freeing

- Page in/out could be filesystem activity

· Swapping

- Large memory processes thrashing?

· Attaches/reclaims

- open/read/close loops on same file

· Kernel memory exhaustion

- sar -k 1 to observe

- lotsfree too close to minfree

- Will drop packets or cause malloc() failures

Chris Drake and Kimberley Brown's "Panic!" is a great reference, including a

host of kernel monitoring and sampling scripts.

Page 59: DocumentM1

Version 3.10 LISA 2006 59

©1994-2006 Hal Stern, Marc Staveley

Other Tunables

· maxpgio

- # swap disks * 40 (Solaris <= 2.6)

- # swap disks * 60 (Solaris == 9)

- # swap disks * 90 (Solaris >= 10)

· maxslp

- Solaris < 2.6� Deadwood timer: 20 seconds

� Set to 0xff to disable pre-emptive swapping

- Solaris >= 2.6� swap out processes sleeping for more than maxslp seconds (20) if avefree < desfree

Tuning these values produces the best returns for your effort.

maxpgio (assumes one operation per revolution * 2/3)

# swap disks * 40 for 3,600 RPM disks

# swap disks * 80 for 7,200 RPM disks

# swap disks * 110 for 10,000 RPM disks

# swap disks * 167 for 15,000 RPM disks

[ 2/3 of the revolutions/second]

maxslp added meaning between Solaris 2.5.1 and 2.6, it is also used as the

amount of time that a process must be swapped out before being considered a

candidate to be swapped back in, in low memory conditions.

Page 60: DocumentM1

Version 3.10 LISA 2006 60

©1994-2006 Hal Stern, Marc Staveley

VM Diagnostics

· Add memory for fileservers

- Improve file cache hit rate

- Calculate expected/physical I/O rates

· Add memory for DBMS servers

- Configure DBMS to use it in cache

- Watch DBMS statistics for use/thrashing

- 100-300M is typical high water mark

· Add memory to eliminate scanning

Page 61: DocumentM1

Version 3.10 LISA 2006 61

©1994-2006 Hal Stern, Marc Staveley

Memory Mapped Files

· mmap() maps open file into address space

- Replaces open(), malloc(), read() cycles

- Improves memory profile for read-only data

- Used for text segments and shared data segments

· Mapped files pages to underlying filesystem

- Text segments paged from NFS server?

- Data files paged over network from server?

· When network performance matters...

- Use shared memory segments, paged locally

NFS-mounted executables produce sometimes unwanted effects due to the way

mmap() works over the network. When you start a Unix process (in SunOS 4.x,

or any SystemV.4/Solaris system), the executable is mapped into memory using

mmap() -- not copied into memory as in earlier BSD days. Once the executable

pages are loaded, you won't notice much difference, but if you free the pages

containing the text segment (due to paging/swapping), you're going to re-read

the data over the wire, not from the local swap device.

Page 62: DocumentM1

Version 3.10 LISA 2006 62

©1994-2006 Hal Stern, Marc Staveley

New VM System (Solaris >= 8)

· Page scanner is a bottleneck for the future

- new hardware supports > 512GB� 64-16M pages to scan!

· File system pressure on the VM

- high filesystem load depletes free memory list

- resulting high scan rates makes applications suffer from excessive page steals

- A server with heavy I/O pages against itself!

· Priority paging (new scanner) is not enough

· Cyclic Page Cache is the current answer

- separate pool for regular file pages

- fs flush daemon becomes fs cache daemon

Page 63: DocumentM1

Version 3.10 LISA 2006 63

©1994-2006 Hal Stern, Marc Staveley

Disk I/O

Section 2.3

Page 64: DocumentM1

Version 3.10 LISA 2006 64

©1994-2006 Hal Stern, Marc Staveley

Disk Activity

· Paging and swapping

- Memory shortfalls

· Database requests

- Lookups, log writes, index operations

· Fileserver activity

- Read, read-ahead, write requests

Page 65: DocumentM1

Version 3.10 LISA 2006 65

©1994-2006 Hal Stern, Marc Staveley

Disk Problems

· Unbalanced activity

- "Hot spot" contention

· Unnecessary activity

- Hit disk instead of memory

· Disks and networks are sources of greatest

gains in tuning

Page 66: DocumentM1

Version 3.10 LISA 2006 66

©1994-2006 Hal Stern, Marc Staveley

Diagnosing Disk Problems

· iostat -D: disk ops/second % iostat -D 5 sd0 sd1rps wps util rps wps util 8 0 22.0 40 0 90.0

- Look for excessive number of ops/disk

- Unbalanced across disks?

· iostat -x: service time (svc_t)

- Long service times (>100 msec) imply queues

- Similar to disk imbalance

- Could be disk overload (20-40 msec)

The typical seek/rotational delays on a disk are 8-15 msec. A typical transfer

takes about 20 msec. If the disk service times are consistently around 20 msec,

the disk is almost always busy. When the service times go over 20 msec, it

means that requests are piling up on the spindle: an average service time of 50

msec means that the queue is about 2.5 requests (50/20) long.

Note that for low I/O volumes, the service times are likely to be inaccurate and

on the high side. Use the service times as a thermometer for disk activity when

you're seeing a steady 10 I/O operations (iops) per second or more.

Page 67: DocumentM1

Version 3.10 LISA 2006 67

©1994-2006 Hal Stern, Marc Staveley

Disk Basics

· Physical things

· Disk performance

- sequential transfer rate� 5 - 40 MBytes/s

� Theoretical max: nsect * 512 * rpm / 60

- 50-100 operations/s random access

- 6-12 msec seek, 3-6 msec rotational delay

- Track-to-track versus long seeks

· Seek/rotational delays

- Access inefficiencies

While nsect * 512 * rpm tells you how fast the spinning disk platter can deliver

data, it's not completely accurate for the zone-bit recorded (ZBR) disks that are

common today. ZBR SCSI disks only fudge the nsect value in the disk

description, providing an average number of sectors per cylinder. In reality, the

first 70% of the disk is quite fast and the last 30% has a lower transfer rate.

Page 68: DocumentM1

Version 3.10 LISA 2006 68

©1994-2006 Hal Stern, Marc Staveley

SCSI Bus Basics

· SCSI 1 (5MHz clock rate)

- 8, 16-bit (wide), or 32-bit (fat)

- Synchronous operation yields 5 Mbyte/sec

· SCSI 2 - Fast (10MHz clock rate)

- 10 Mbytes/s with 8-bit bus

- 20 Mbytes/s with 16-bit (wide) bus

· Ultra (20MHz clock rate)

- Ultra/wide = 40MB/sec

· Ultra 2 (40MHz clock rate)

· Ultra 3 (80MHz clock rate)

If devices from different standards exist on the same SCSI bus then the clock rate

of all devices is the clock rate of the slowest device.

Ultra 3 is sometimes called Ultra 160.

Page 69: DocumentM1

Version 3.10 LISA 2006 69

©1994-2006 Hal Stern, Marc Staveley

SCSI Cabling Basics

· Single Ended

- 6m for SCSI 1

- 3m for SCSI 2

- 3m for Ultra up to 4 devices

- 1.5m for Ultra > 4 devices

· Differential

- 25m cabling

· Low Voltage Differential (LVD)

- 12m cabling

- used by Ultra 2 and 3

Differential signaling is used to suppress noise over long distances. If you ask a

friend to signal you with a lantern, it's easy to distinguish high (1) from low (0).

If the friend is now standing on a boat, which introduces noise (waves), it's

much harder to differentiate high and low. Instead, give your friend two

lanterns, and define "high" as "lanterns apart" and "low" as "lanterns together".

The noise affects both lanterns, but measuring the difference between them edits

the noise from the resulting signal.

If Single Ended and LVD exist on the same bus then the cabling lengths are the

minimum of the two.

Page 70: DocumentM1

Version 3.10 LISA 2006 70

©1994-2006 Hal Stern, Marc Staveley

Fibre Channel and iSCSI

· Industry standard at the frame level

- FC-AL: fiber channel arbitrated loop

- 100 Mbytes/sec typical

- Use switches and daisy chains to build storage networks

· Vendors layer SCSI protocol on top

- SCSI disk physics still apply

- But you can pack a lot of disks on the fiber

· Ditto iSCSI over GigE

Page 71: DocumentM1

Version 3.10 LISA 2006 71

©1994-2006 Hal Stern, Marc Staveley

The I/O Bottleneck

· When can't an 72GB disk hold a 500MB DB?

- When you need more than 100 I/Os per second

· How do you get > 40MByte/s file access?

- Gang disks together to "add" transfer rates

· Key info nugget #1: Access pattern

- Sequential or random, read-only or read-write

· Key info nugget #2: Access size

- 2K-8K DMBS, but varies widely

- 8K NFS v2, 32K NFS v3

- 4K-64K filesystem

Realize that when you're bound by random I/O rates, you're not moving that

much data -- the bottleneck is the physical disk arm moving across the disk

surface.

At 100 I/O operations/sec, and 8 KBytes/operations, a SCSI disk moves only

800 KBytes/sec at maximum random I/O load.

The same disk will source 40 MBytes/sec in sequential access mode, where the

disk speed and interface are the limiting factors.

Page 72: DocumentM1

Version 3.10 LISA 2006 72

©1994-2006 Hal Stern, Marc Staveley

Disk Striping

· Combine multiple disks into single logical disk

with new properties

- Better transfer rate

- Better average seek time

- Large capacity

· Terminology

- Block size: chunk of data on each disk in stripe

- Interleave: number of disks in stripe

- Stripe size: block size * interleave

Page 73: DocumentM1

Version 3.10 LISA 2006 73

©1994-2006 Hal Stern, Marc Staveley

Volume Management

· Striping done at physical (raw) level

- Run raw access processes on stripe (DBMS)

- Can build filesystem on volume, after striping

- Host (SW) or disk array (HW) solutions

· Some DBMSs do striping internally

· Bottleneck: multiple writes

- Stripe over multiple controllers, SCSI busses

Page 74: DocumentM1

Version 3.10 LISA 2006 74

©1994-2006 Hal Stern, Marc Staveley

Striping For Sequential I/O

· Each request hits all disks in parallel

· Add transfer rates to "lock heads"

· Block size = access size/interleave

· Examples:

- 64K filesystem access, 4 disks, 16K/disk

- 8K filesystem access, 8 disks, 1K/disk

· Can get 3-3.5x single disk

- On a 4-6 way stripe

Page 75: DocumentM1

Version 3.10 LISA 2006 75

©1994-2006 Hal Stern, Marc Staveley

Striping For Random I/O

· Each request should hit a different disk

· Random requests use all disks

- Force scattering of I/O

· Reduce average seek time with "independent

heads"

· Block size = access size

· Examples:

- 8K NFS access on 6 disks, 48K stripe size

- 2K DBMS access on 4 disks, 8K stripe size

Page 76: DocumentM1

Version 3.10 LISA 2006 76

©1994-2006 Hal Stern, Marc Staveley

Transaction Modeling

· Types: read, write, modify, insert

· Meta data structure impact

- Filesystem structures: inodes, cylinder groups, indirect blocks

- Logs and indexes for DBMS

Insert operation is R-M-W on index, W on data, W on log

Insert/update on DBMS touches data, index, log

Page 77: DocumentM1

Version 3.10 LISA 2006 77

©1994-2006 Hal Stern, Marc Staveley

Cache Effects

· Not every logical write I/O hits disk

- DB write clustering

- NFS, UFS dirty page clustering

- Hardware arrays may cache operations

· Reads can be cached

- DB page/block cache (Oracle SGA, e.g.)

- File/data caching in memory

· Locality of reference

- Cache can help or hurt performance

Page 78: DocumentM1

Version 3.10 LISA 2006 78

©1994-2006 Hal Stern, Marc Staveley

Simple DBMS Example

· Medium sized database on a busy day

- 200 users, 8 Gbyte database, 1 request/10 sec

- 50% updates, 20% inserts, 30% lookups, 4 tables, 1 index on each

· Disk I/O rate calculation

- .5 * 4/U + .2 * 3/I + .3 * 2/L = 3.2 I/O per table

- 12.8 I/O per transaction, ~10 with caching?

· Arrival rate

- 200 users * 1 / 10 secs = 20/sec

- Demand: 200 I/Os/sec, peak to 220 or more

The sample disk I/O rates are derived as follows:

Updates have to do a read, an update to an index and an update to a data block,

as well as a log write (4 transactions)

Inserts do an index and data block write, and a log write (3 transactions)

Lookups read from the index and data blocks (2 transactions)

Page 79: DocumentM1

Version 3.10 LISA 2006 79

©1994-2006 Hal Stern, Marc Staveley

Haste Needs Waste

· Using a single disk is a disaster

- Disk can only do 50-60 op/s, response time ~ 10/s

· 4 disks barely do the job

- Provides 200-240 I/Os/sec

- DBMS uses 90% of I/O rate capacity

· 6 disks would be better

- Waste most of the available space

Page 80: DocumentM1

Version 3.10 LISA 2006 80

©1994-2006 Hal Stern, Marc Staveley

Filesystem Optimization

Section 2.4

Page 81: DocumentM1

Version 3.10 LISA 2006 81

©1994-2006 Hal Stern, Marc Staveley

UNIX Filesystem

· Filesystem construction

- Each file identified by inode

- Inode holds permissions, modification/access times

- Points to 12 direct (data) blocks and indirect blocks

· Indirect block contains block pointers to data

blocks

· Double indirect blocks contain pointers to

blocks that contain pointers to data blocks

Page 82: DocumentM1

Version 3.10 LISA 2006 82

©1994-2006 Hal Stern, Marc Staveley

UFS Inode

Mode, time

Owners

Etc...

Indirect

Double ind.

12

direct

blocks

Data

Data

Data

2048

slots

2048

slots

2048

slots

Data

Data

Data

Data

Data

Data

Inode

2048

slots

2048

slots

Data

Data

Data

- Direct blocks up to 100 KBytes

- Indirect blocks up to 100 MBytes

- Double indirect blocks up to 1 TByte

Page 83: DocumentM1

Version 3.10 LISA 2006 83

©1994-2006 Hal Stern, Marc Staveley

Filesystem Mechanics

· Inodes are small and of fixed size

- Fast access, easy to find

· File writes flushed every 30 seconds

- Sync or update daemon

- UNIX writes are asynchronous to process

- Watch for large bursts of disk activity

· Filesystem metadata

- Create redundancy for repair after crash

- Cylinder groups, free lists, inode pointers

- fsck: scan every block for "rollback"

The fact that write() doesn't complete synchronously can cause bizarre failures.

Most code doesn't check the value of errno after a close(), but it should. Any

outstanding writes are completed synchronously when close() is called.

If any errors occurred during those writes, the error is reported back through

close(). This can cause a variety of problems when quotas are exceeded or disks

fill up (over NFS, where the server notices the disk full condition).

More details: SunWorld Online, System Administration, October 1995

http://www.sun.com/sunworldonline

Page 84: DocumentM1

Version 3.10 LISA 2006 84

©1994-2006 Hal Stern, Marc Staveley

The BSD Fast Filesystem

· Original UNIX filesystem

- All inodes at the beginning of the disk

- open() followed by read() always seeks

· BSD FFS improvements

- Cylinder groups keep inodes and data together

- Block placement strategy minimizes rotational delays

- Inode/cylinder group ratio governs file density

· Minfree: default 10%, safe to use 1% on 1+G

disks

McKusick, Leffler, Quaterman and Karels, "Design and Implementation of the

4.3 BSD Operating System"

mkfs and newfs always look at the # bytes per inode parameter (fixed). To

change the inode density, you need to change the number of cylinders in a

group by adjusting the number of sectors/track:

Filesystems for large files (like CAD parts files) usually have more bytes per

inode; filesystems for newsgroups should have fewer bytes per inode (with the

exception of the filesystem for alt.binaries.*)

Page 85: DocumentM1

Version 3.10 LISA 2006 85

©1994-2006 Hal Stern, Marc Staveley

Fragmentation & Seeks

· Fragments occur in last block of file

- Frequently less than 1% internal fragmentation

- 10% free space reduces external fragmentation

· Block placement strategy breaks down

- Avoid filling disk to > 90-95% of capacity

- Introduces rotational delays

· File ordering affects performance

- Seeking across large disk for related files

Page 86: DocumentM1

Version 3.10 LISA 2006 86

©1994-2006 Hal Stern, Marc Staveley

Large Files

· Reading

- Read inode, indirect block, double indirect block, data block

- Sequential access should do read-ahead

· Writing

- Update inode, (double) indirect, data blocks

- Can be up to 4 read-modify-write operations

· Large buffer sizes are more efficient

- Single access for "window" of metadata

Page 87: DocumentM1

Version 3.10 LISA 2006 87

©1994-2006 Hal Stern, Marc Staveley

Turbocharging Tricks

· Striping

· Journaling (logging)

- Write meta data updates to log, like DBMS

- Eliminate fsck, simply replay log

- Ideal for synchronous writes, large files

- logging option (Solaris >= 7)

· Extents

- Cluster blocks together and do read-ahead

- Eliminate more rotational delays

- Can add 2-3x performance improvement

McVoy and Kleiman, "Extent-like Performance From The UNIX Filesystem",

Winter USENIX Proceedings, 1991.

Linux also has the EXT2 filesystem, which is extent based and has different

placement policies.

Journaling and logging are often used interchangeably. Logging filesystems and

log-based filesystems, however, are not the same thing. A logging filesystem

bolts a log device onto the UNIX filesystem to accelerate writes and recovery. A

log-based filesystem is a new (non-BSD FFS) structure, based on a log of write

records. There is a long and exacting description of the differences in Margo

Seltzer's PhD thesis from UC-Berkeley.

Page 88: DocumentM1

Version 3.10 LISA 2006 88

©1994-2006 Hal Stern, Marc Staveley

Access Patterns

· Watch actual processes at work

- What are they doing?� nfswatch: NFS operations on the wire

� truss (Solaris, SysV.4), strace (Linux, HPUX),ktrace (*BSD)

� dtrace (Solaris >= 10)

· Application write size should match filesystem

block size.

· Use a Filesystem benchmark

- Are the disks well balanced, is the filesystem well tuned?

� filebench, bonnie

More details on using these tools: SunWorld Online, System Administration,

September 1995

http://www.sun.com/sunworldonline

Don't use process tracing for performance-sensitive issues, because turning on

system call trapping (used by the strace/truss facility) slows the process down

to a snail's pace.

Solaris Dtrace (Solaris >= 10) is more light weight.

Bonnie (http://www.textuality.com/bonnie) is a good all-round Unix

filesystem benchmark tool

Filebench extensible system to simulate many different types of workloads

http://sourceforge.net/projects/filebench/

http://www.opensolaris.org/os/community/performance/filebench/

Page 89: DocumentM1

Version 3.10 LISA 2006 89

©1994-2006 Hal Stern, Marc Staveley

Resource Optimization

· Optimize disk volumes by type of work

- Sequential versus random access filesystems

- Read-only versus read-write data

· Eliminate synchronous writes

- File locking or semaphores more efficient

- Journaling filesystem faster

· Watch use of symbolic links

- Often causes disk read to get link target

· Don't update the file access time for read-only

volumes

Don't update the file access time (for news and mail spools, etc.)

-o noatime

Delay updating file access time (Solaris >= 9)

-o dfratime

Page 90: DocumentM1

Version 3.10 LISA 2006 90

©1994-2006 Hal Stern, Marc Staveley

Non-Volatile Memory

· Battery backed memory

- RAM in disk array controller (array cache)

- disk cache

· Synchronous writes at memory write speed

Page 91: DocumentM1

Version 3.10 LISA 2006 91

©1994-2006 Hal Stern, Marc Staveley

Inode Cache

· Inode cache for metadata only

- Data blocks cached in VM or buffer pool

· Buffer pool for inode transit� vmstat -b

� sar -b 5 10

- Watch %rcache (read cache) hit rate

- Lower rate means more disk I/O for inodes

· Set high water mark� set bufhwm=8000 Solaris /etc/system

Page 92: DocumentM1

Version 3.10 LISA 2006 92

©1994-2006 Hal Stern, Marc Staveley

Directory Name Lookup Cache

· Name to inode mapping cache

· Must be large for file/user server

- Low hit rate causes disk I/O to read directories

· vmstat -S to observe

- Aim for > 90% hit rate

· Causes of low hit rates:

- File creation automatically misses

- Names > {14,32} characters not inserted� Long names not efficient

Solaris >= 2.6

- uses the ncsize parameter to set the DNLC size.

- handles long filenames in the DNLC

Solaris >= 8

- can use the kstat -n dnlcstats command to determine how well the DNLC is doing

Page 93: DocumentM1

Version 3.10 LISA 2006 93

©1994-2006 Hal Stern, Marc Staveley

Filesystem Replication

· Replicate popular read-only data

- Automounter or "workgroups" to segregate access

- Define update and distribution policies

· 200 coders chasing 4 class libraries

- Replicate libraries to increase bandwidth

· Hard to synchronize writeable data

- Similar to DBMS 2-phase commit problem

- Andrew filesystem (AFS)

- Stratus/ISIS rNFS, Uniq UPFS from Veritas

The ISIS Reliable NFS product is now owned by Stratus Computer,

Marlborough MA

Uniq Consulting Corp has a similar product that does N-way mirroring of NFS

volumes. Contact Kevin Sheehan at [email protected], or your local Veritas

sales rep, since Veritas is now reselling (and supporting) this product

Page 94: DocumentM1

Version 3.10 LISA 2006 94

©1994-2006 Hal Stern, Marc Staveley

Depth vs. Breadth

· Avoid large files if possible

- Break large files into smaller chunks

- Don't backup a 200M file for a 3-byte change

- Files > 100M require multiple I/Os per operation

· Directory search is linear

- Avoid broad directories

· Name lookup is per-component

- Avoid deep directories

- Use hash tables if required

Page 95: DocumentM1

Version 3.10 LISA 2006 95

©1994-2006 Hal Stern, Marc Staveley

Tuning Flush Rates

· Dirty buffers flushed every 30 seconds

- Causes large disk I/O burst

- May overload single disk

· Balance load if requests < 30s apart

- Generic update daemonwhile : do sync; sync; sleep 15; done

· Solaris tunables

- autoup: time to cover all memory

- tune_t_fsflushr: rate to flush

autoup is the oldest a dirty buffer can get before it is flushed. tune_t_fsflushr is

the rate at which the sync daemon is run; it defaults to 30 seconds.

All of memory will be covered in autoup seconds.

flushrate/autoup is the fraction covered by each pass of the update daemon.

Increase autoup, or cut the flush rate, to space out the bursts

Extremely large disk service times (in excess of 100 msec) can be caused by large

bursts from the flush daemon causing a long disk queue. If the filesystem flush

sends 20 requests to a single disk, it's likely there will be some seeking between

writes, so the 20 requests will average 20 msec each to complete. Since all disk

requests are scheduled in a single pass by fsflush, the service time for the last

one will be nearly 400 msec, while the first few will finish in around 20 msec,

yielding an average service time of 200 msec!

Page 96: DocumentM1

Version 3.10 LISA 2006 96

©1994-2006 Hal Stern, Marc Staveley

Backups & Redundancy

Section 2.5

Page 97: DocumentM1

Version 3.10 LISA 2006 97

©1994-2006 Hal Stern, Marc Staveley

Questions of Integrity

· Backups are total loss insurance

- Lose a disk

- Lose a brain: egregious rm *

· Disk integrity is inter-backup insurance

- Preserve data from high-update environment

- Time to restore backup is unacceptable

- Doesn't help with intra-day deletes

· Disaster recovery is a separate field

Page 98: DocumentM1

Version 3.10 LISA 2006 98

©1994-2006 Hal Stern, Marc Staveley

Disk Redundancy

· Snapshots

- Copy data to another disk or machine

- tar, dump, rdist, rsync

- Prone to failure, network load problems

· Disk mirroring (RAID 1)

- Highest level of reliability and cost

- Some small performance gains

· RAID arrays (RAID 5 and others)

- Cost/performance issues

- VLDB byte capacity

RAID = Redundant Array of Inexpensive Disks.

When the RAID levels were created (at UC-Berkeley), the popular disk format

was SMD (as in Storage Modular Device, not Surface Mounted Device).

10" platters weighed nearly 100 pounds and held 500 MB, while SCSI disks

topped out at 70 MB but cost significantly less (and were easier to lift and install)

Page 99: DocumentM1

Version 3.10 LISA 2006 99

©1994-2006 Hal Stern, Marc Staveley

RAID 1: Mirrored Disks

· 100% data redundancy

- Safest, most reliable

- Historically rejected due to disk count, cost

· Best performance (of all RAID types)

- Round-robin or geometric reads: like striping

- Writes at 5-10% hit

· Combine mirroring and striping

- Stripe mirrors (1+0) to survive interleave failures

- Mirror stripes (0+1) for safety with minimal overhead

RAID 0 = striping

Few systems can do 1+0

1+0 allows multi-disk failures as long as at least one mirror disk per stripe

survives.

Page 100: DocumentM1
Page 101: DocumentM1

Version 3.10 LISA 2006 101

©1994-2006 Hal Stern, Marc Staveley

RAID 5: Parity Disks

· Stripe parity and data over disks

· No single "parity hot spot"

· Performance degrades with more writes

- R-M-W on parity disk cuts 60%

- Similar to single-disk for reads

· Ideal for large DSS/DW databases

- If size >> performance, RAID 5 wins

- Best path to large, safe disk farm

· 20-40% cost savings

Page 102: DocumentM1

Version 3.10 LISA 2006 102

©1994-2006 Hal Stern, Marc Staveley

RAID 5 Tuning

· Tunables

- Array width (interleave) - sometimes

- Block size - required

· Count parity operations in I/O demand

- Read = 1 I/O

- Write = 4 I/O

· Ensure parity data is not a bottleneck

- Averaging parity disk reads and writes limited by (total) 50-60 IOP/second limit

RAID 5 write:- read original block- read parity block - xor original block with parity block - xor new block with parity block- write new block- write parity block

Page 103: DocumentM1

Version 3.10 LISA 2006 103

©1994-2006 Hal Stern, Marc Staveley

Backup Performance

· Derive rough transfer rate needs

- 100 GB/hour = 30 MB/second

- 5MB/s for DLT, 10MB/s for SDLT

- 15MB/s for LTO, 35MB/s for LTO-260MB/s for LTO-3

- 6MB/s for AIT, 24MB/s for AIT-4

- 80MB/sec over quiet Ethernet (GigE)

· Multiple devices increase transfer rate

- Stackers grow volume

- Drives increase bulk transfer rate

- Careful of “shoe shining”

When designing the backup system, also take into consideration the length of

time you must keep the data around. Some industries, such as financial

services, require at least a 7 year history of data for SEC or arbitration hearings.

Drug companies and health-care firms must keep patient data near-line until the

patient dies; if a drug pre-dated a patient by 5 years then you're looking at the

better part of a century.

Media types in vogue today decay. Magnetic media loses its bits; CD-ROMs

may decay after a long storage period. How will you read your backups in the

future? If you've struggled with 1600bpi tapes lately you know the feeling of

having data in your hand that's not convertible into on-line form.

Final warning: dump isn't portable! If you change vendors, make sure you can

dump and reload your data.

Page 104: DocumentM1

Version 3.10 LISA 2006 104

©1994-2006 Hal Stern, Marc Staveley

Backup to Disk

· Rdiff-backup

- incremental backups to disk with easy restore

· BackupPC

- incremental and full backups to disk with a web front end for scheduling and restores

- good for backing up MSwindows clients to Unix server

· Snapshots

· Offsite replicas

Is this familiar:

- Secure the individual systems

- Run aggressive password checkers

- Restrict NFS, or use NFS with Kerberos or DCE/DFS to encrypt file access

- Prevent network logins in the clear (use ssh)

- BUT: do backups over the network! Exposing the data over the network

during the backup un-does much of the effort in the other precautions.

Page 105: DocumentM1

Version 3.10 LISA 2006 105

©1994-2006 Hal Stern, Marc Staveley

NFS Performance Tuning

Section 3

Page 106: DocumentM1

Version 3.10 LISA 2006 106

©1994-2006 Hal Stern, Marc Staveley

Topics

· NFS internals

· Diagnosing server problems

· Client improvements

- Client-side caching & tuning

- NFS over WANs

Page 107: DocumentM1

Version 3.10 LISA 2006 107

©1994-2006 Hal Stern, Marc Staveley

NFS Internals

Section 3.1

Page 108: DocumentM1

Version 3.10 LISA 2006 108

©1994-2006 Hal Stern, Marc Staveley

NFS Request Execution

stat()

getattr()

nfs_getattr()

kernel RPC

port 2049 hardcoded

nfsd

getattr()

UFS stat()

HSFS stat()

% ls -l

Page 109: DocumentM1

Version 3.10 LISA 2006 109

©1994-2006 Hal Stern, Marc Staveley

NFS Characterization

· Small and large operations

- Small: getattr, lookup, readdir

- Large: read/write, create, readdir

· Response time matters

- Clamped at 50 msec for "reasonable" server

- Users notice 20 msec to 50 msec dropoff

· Scalability is still a concern

- Usually network limited, hard to reach capacity

- Flat response time is best measure

- Client-side demand management

Page 110: DocumentM1

Version 3.10 LISA 2006 110

©1994-2006 Hal Stern, Marc Staveley

NFS over TCP

· NFS/TCP is a win for:

- Wide-area networks, with higher bit error rates

- Routed networks

- Data-transfer oriented environments

- Large MTU networks, like GigE with jumbo frames

· Advantages

- Better error recovery, without complete retransmit

- Fewer retransmissions and duplicate requests

· Disadvantage

- Connection setup at mount time

Page 111: DocumentM1

Version 3.10 LISA 2006 111

©1994-2006 Hal Stern, Marc Staveley

NFS Version 3

· Improved cache consistency

- Attributes returned with most calls

- "access" RPC mimics permission checking of local system open() call

· Improved correctness with NFS/TCP

· Performance enhancements

- Asynchronous write operations, with logging

- Larger buffer sizes, up to 32KBytes

NFS v3 uses a 64-byte (not bit) file handle, with the actual size used per mount

negotiated between the client and server.

Page 112: DocumentM1

Version 3.10 LISA 2006 112

©1994-2006 Hal Stern, Marc Staveley

Diagnosing Server Problems

Section 3.2

Page 113: DocumentM1

Version 3.10 LISA 2006 113

©1994-2006 Hal Stern, Marc Staveley

Indicators

· Usual server tuning applies

- Don't worry about CPU utilization

- Client response time is early warning system

- Some NFS specific details

· Server isn't always the limiting factor

- Typical Ethernet supports 300-350 LADDIS ops

- To get 2,000 LADDIS: 7-8 Ethernets

LADDIS stands for Legato, Auspex, Digital, Data General, Interphase and Sun,

the 6 companies that helped produce the SPEC standard for NFS benchmarks.

LADDIS is now formally known as SPEC NFS and is reported as a number of

ops/sec, at 50 msec response time or less.

More info: [email protected]

Keith, Bruce. LADDIS: The Next Generation in NFS Server Benchmarking. spec

newsletter. March 1993. Volume 5, Issue 1.

Watson, Andy, et.al. LADDIS: A Multi-Vendor and Vendor-Neutral SPEC

NFS Benchmark. Proceedings of the LISA VI Conference, October 1992. pp. 17-

32.

Wittle, Mark, and Bruce Keith. LADDIS: The Next Generation in NFS File

Server Benchmarking. Proceedings of the Summer 1993 USENIX Conference.

July 1993. pp. 111-128.

Page 114: DocumentM1

Version 3.10 LISA 2006 114

©1994-2006 Hal Stern, Marc Staveley

Request Queue Depth

· nfsd daemons/threads

- One request per nfsd daemon

- Lack of nfsds makes server drop requests

- May show up as UDP socket overflows (netstat -s)

· Guidelines

- Daemons: 24-32 per server, more for many disks

- Kernel threads (Solaris): 500-2000� no penalty for being long

- Add more for browsing environment

Page 115: DocumentM1

Version 3.10 LISA 2006 115

©1994-2006 Hal Stern, Marc Staveley

Attribute Hammering

· Use nfsstat -s to view server statistics

· getattr > 40%

- Increase client attribute cache lifetime

- Consolidate read-only filesystems

· readlink > 5%

- Replace links with mount points

· writes > 5%

- NVRAM situation

% nfsstat -s

null getattr setattr root lookup readlink read

32 0% 527178 33% 9288 0% 0 0% 449726 28% 189466 12% 188665 15%

wrcache write create remove rename link symlink

0 0% 134797 8% 13799 0% 15826 1% 2725 0% 4388 0% 74 0%

mkdir rmdir readdir fsstat

1575 0% 1532 0% 23898 1% 242 0%

On an NFS V3 client, you'll see entries for cached writes, access calls, and other

extended RPC types.

Page 116: DocumentM1

Version 3.10 LISA 2006 116

©1994-2006 Hal Stern, Marc Staveley

Transfer Oriented Environments

· Ensure adequate CPU

- 1 CPU per 3-4 100BaseT Ethernets

- 1 CPU per 1.5 ATM networks at 155 Mb/s

- 1 CPU per 1000BaseT Ethernet (GigE)

· Disk balancing is critical

- Optimize for random I/O workload

· Large memory may not help

- What is working set/file lifecycle?

Page 117: DocumentM1

Version 3.10 LISA 2006 117

©1994-2006 Hal Stern, Marc Staveley

Client Improvements

Section 3.3

Page 118: DocumentM1

Version 3.10 LISA 2006 118

©1994-2006 Hal Stern, Marc Staveley

Client Tuning Overview

· Eliminate end to end problems

- Request timeouts are call to action

- 700 msec timeout versus 50 msec "pain" level

· Reduce demand with improved caching

· Adjust for line speed

- < Ethernet links

- Uncontrollable congestion

- Routers or multiple hops

· Application tuning rules apply

Page 119: DocumentM1

Version 3.10 LISA 2006 119

©1994-2006 Hal Stern, Marc Staveley

Client Retransmission (UDP only)

· Unanswered RPC request is retransmitted

- Repeated forever for hard mounts

- Up to 5 times (retrans) for soft mounts

· What can go wrong?

- Severe congestion (storms)

- Server dropping requests/packets

- Network losing requests or sections of them

- One lost packet kills entire request

Page 120: DocumentM1

Version 3.10 LISA 2006 120

©1994-2006 Hal Stern, Marc Staveley

Measuring Client Performance

· Client-side performance is what user sees

· nfsstat -m OK for NFS over UDP

- Shows average service time for lookup, read and write requests

· iostat -n with extended service times

· NFS over TCP harder to measure

- Stream-oriented, difficult to match requests and replies

- tcpdump, snoop to match XIDs in NFS header� wireshark (ethereal) does this.

Page 121: DocumentM1

Version 3.10 LISA 2006 121

©1994-2006 Hal Stern, Marc Staveley

Client Impatience (NFS over UPD only)

· Use nfsstat -rc

· timers > 0

- Server slower than expected

- nfsstat -m: expected response time 1400 msecretrans++

700 msecretrans++

120 msectimers++

calls++ request

% nfsstat -rc

Client rpc:

calls badcalls retrans badxid timeout wait newcred timers

224978 487 64 263 549 0 0 696

% nfsstat -m

/home/thud from thud:/home/thud (Addr 192.151.245.13)

Lookups: srtt = 7 (17 ms), dev=4 (20ms), cur=2 (40ms)

Reads: srtt=14 (35 ms), dev=3 (15ms), cur-3 (60ms)

Note that the NFS backoff and retransmit scheme is not used for NFS over TCP,

since TCP's congestion control and restart algorithms properly fit the connection

oriented model of TCP traffic. The NFS mechanism is used for UDP mounts,

and the timers used for adjusting the buffer sizes and transmit intervals are

shown with nfsstat -m. On an NFS/TCP client, the timers will be zero.

· badcalls > 0

· Soft NFS mount failures

· Operation interrupted (application failure)

· Data loss or application failures

· Should never see these

Page 122: DocumentM1

Version 3.10 LISA 2006 122

©1994-2006 Hal Stern, Marc Staveley

Client's Network View (NFS over UPD only)

· retrans > 5%

- Requests not reaching server or not serviced

· badxid close to 0

- Network is dropping requests

- Reduce rsize, wsize on mount

· badxid > 0

- Duplicate request cache isn't consolidating retransmissions

- Tune server, partition network

Using NFS/TCP or NFS Version 3, you'll be hard-pressed to see badxid counts

above zero. Using TCP, the NFS client doesn't have to retransmit the whole

request, only the part that was lost to the server. As a result, there should rarely

be completely retransmitted requests. NFS V3 implementations also tend to be

more "correct" than V2 implementations, since fewer requests that are not

actually idempotent (like rmdir or remove) are retransmitted.

Page 123: DocumentM1

Version 3.10 LISA 2006 123

©1994-2006 Hal Stern, Marc Staveley

Client Caches

· Caching critical for performance

- If data exists, don't go over the wire

- Dealing with stale data

· Cached items

- Data pages in memory: default

- Data pages on disk: eNFS, CacheFS

- File attributes: in memory

- Directory attributes: in memory

- DNLC: local name lookup cache

Page 124: DocumentM1

Version 3.10 LISA 2006 124

©1994-2006 Hal Stern, Marc Staveley

Attribute Caching

· getattr requests can be > 40% of total

- May hit server disk

· Read-only filesystems

- Increase actimeo to 600 or more

- "Slow start" when data really changes

· Rapidly changing filesystem (mail)

- Try noac for no caching

· File locking disables attribute and data caching

When a file is locked on the client system, that client begins to read and write

the file without any buffering. If your application calls

read(fd, buf, 128);

you'll read exactly 128 bytes over the wire from the NFS server, bypassing the

attribute cache and the local memory cache to be sure you fetch the latest copy

of the data from the server.

If file locking and strict ordering of writes are an issue, consider using a

database.

Page 125: DocumentM1

Version 3.10 LISA 2006 125

©1994-2006 Hal Stern, Marc Staveley

CacheFS Tips

· Read-mostly, fixed size working set

- Re-use files after loading into cache

- Write-once files are worst case

- Growing or sparse working set causes thrashing

· Watch size of cache using df

· Multi-homed hosts

- CacheFS creates one cache per host name

- Make client bindings persistent, not random

- Penalty for cold cache less than that for no server

Using CacheFS solves the page-over-the-network problem where a process' text

segment is paged from the NFS server, not from a local disk. When using large

executables (some CAD applications, FORTRAN with many common blocks),

CacheFS may improve paging performance by keeping traffic on the local host.

Page 126: DocumentM1

Version 3.10 LISA 2006 126

©1994-2006 Hal Stern, Marc Staveley

Buffer Sizing

· Default of 8KB good for Ethernet speeds

- At 56 Kb requires > 1 second to transmit

- Remarkably anti-social behavior

- Even worse for NFSv3 (32KB packets)

· Reduce read/write sizes on slow links

- In vfstab, automounter� rsize=1024,wsize=2048

- Match to line speed and other uses

- 256 bytes is lower limit� readdir breaks with smaller buffer

Per Packet Time to ReadLine Speed rsize Latency 1 kbyte File

56 kbaud 128 bytes 20 msec 430 msec56 kbaud 256 bytes 40 mses 310 msec224 kbaud 256 bytes 10 msec 150 msecT1 (1.5 Mbit) 1024 bytes 1 msec 42 msec

Page 127: DocumentM1

Version 3.10 LISA 2006 127

©1994-2006 Hal Stern, Marc Staveley

Network Design and Capacity

Planning

Section 4

Page 128: DocumentM1

Version 3.10 LISA 2006 128

©1994-2006 Hal Stern, Marc Staveley

Topics

· Network protocol operation

· Naming services

· Network topology

· Network latency

· Routing architecture

Network reliability colors end to end performance. If your network is delaying

traffic or losing packets, or if you suffer multiple network hops each with long

latency, you will impact what the user sees. The worst possible example is the

Internet: you get variable response time depending upon how many people are

downloading images, what current events have users flocking to the major sites,

and the time of day/day of the week.

Page 129: DocumentM1

Version 3.10 LISA 2006 129

©1994-2006 Hal Stern, Marc Staveley

Network Protocol

Operation

Section 4.1

Page 130: DocumentM1

Version 3.10 LISA 2006 130

©1994-2006 Hal Stern, Marc Staveley

Up & Down Protocol Stacks

ICMP

ARP:update cache

copy into kernel

TCP slow startTCP segmentation

IP: locate route/interfaceIP: MTU fragmentation

IP: find MAC address

Eth: send packet

RIP update

route tables

ARP: get IP mapping

backoff/re-xmit

collision

Eth: accept frame

TCP/UDP: valid port?IP re-assembly

IP: match local IP?

read() on socketwrite() on socket

Solaris exposes nearly every tunable parameter in the TCP, UDP, IP, ICMP and

ARP protocols using the ndd tool.

Find a description of the tunable parameters and their upper/lower bounds on

Richard Steven's web page containing the Appendix to his latest TCP/IP books:

http://www.kohala.com/start/tcpipiv1.appe.update1.ps

Also on docs.sun.com at http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters

Solaris 2 - Tuning Your TCP/IP Stack and More

http://www.sean.de/Solaris

Page 131: DocumentM1

Version 3.10 LISA 2006 131

©1994-2006 Hal Stern, Marc Staveley

Naming Services

Section 4.2

Page 132: DocumentM1

Version 3.10 LISA 2006 132

©1994-2006 Hal Stern, Marc Staveley

Round-Robin DNS

· Use several servers, in parallel, that have unique

IP addresses

- DNS will return all of the IP addresses in response to queries for www.blahblah.com

· Clients resolving the name get the IP addresses

in round-robin fashion

- When DNS cache entry times out, new one is requested

- Clients will wait up to DNS entry lifetime for a retry

Be sure to set the DNS server entry's Time To Live (TTL) to zero or a few

seconds, such that successive requests for the IP address of the named host get

new DNS entries

Name Servers that do Round-Robin:

- BIND 8

- djbdns

- lbnamed (true load balancer written in perl)

Page 133: DocumentM1

Version 3.10 LISA 2006 133

©1994-2006 Hal Stern, Marc Staveley

Round-Robin DNS, cont'd

· The Good

- No real failure management, it "just works"

- Scales very well; just add hardware and mix

- Only 1/N clients affected, on average, for N server farm (for a server failure)

· The Bad

- Clients can see minutes of "downtime" as DNS entries expire, if TTL is too long

- Can cheat with multiple A records per host, but not all clients sort them correctly

- None if done correctly

Page 134: DocumentM1

Version 3.10 LISA 2006 134

©1994-2006 Hal Stern, Marc Staveley

IP director

Web server

Web server

Web server

Web server

192.9.230.1

192.9.231.0

x.x.x.1

x.x.x.4

x.x.x.2

x.x.x.3

192.9.232.0

IP Redirection

IP director

www.blah.com

Page 135: DocumentM1

Version 3.10 LISA 2006 135

©1994-2006 Hal Stern, Marc Staveley

IP Redirection Mechanics

· Front-end boxes handle IP address resolution

- Public IP address shows up DNS maps

- Internal (private) IP networks used to distribute load

- Can have multiple networks, with multiple servers

· Improvement over DNS load balancing

- All round-robin choices made at redirector, so client's DNS configuration (or caches) don't matter

- Redirector can be made redundant

- Hosts could be redundant, too

· Cisco NetDirector, Hydra HydraWeb

Page 136: DocumentM1

Version 3.10 LISA 2006 136

©1994-2006 Hal Stern, Marc Staveley

Network Topology

Section 4.3

Page 137: DocumentM1

Version 3.10 LISA 2006 137

©1994-2006 Hal Stern, Marc Staveley

SwitchTrunking (802.3ad)

· Multiple connections to host from single switch

· Improves input and output bandwidth

- Eliminates impedance mismatch between switch and network connection

- Spread out input load on server side

· Warnings:

- Trunk can be a SPOF

- Assumes switch can handle mux/demux of traffic at peak loads

Solaris requires the SUNWtrku package (Sun Trunking Software)

Page 138: DocumentM1

Version 3.10 LISA 2006 138

©1994-2006 Hal Stern, Marc Staveley

Latency and Collisions

· Collisions

- CSMA/CD works "late", node backs off, tries again

- Fills Ethernet with malformed frames

· Defers

- CSMA/CD works "early"

- Not counted, but adds to latency

- Collisions "become" defers as more nodes share load

- Use netstat -k (Solaris >= 2.4) or kstat (Solarsis >= 7) to see defers and other errors

· 802.11

Page 139: DocumentM1

Version 3.10 LISA 2006 139

©1994-2006 Hal Stern, Marc Staveley

Dealing With Collisions

· Rate = collisions/output packets

· Collisions counted on transmit only

- Monitor on several hosts, especially busy ones

- Use netstat -i or LANalyzer to observe

- Collision rate can exceed 100% per host

· Thresholds

- Should decrease with number of nodes on net

- >5% is clear warning sign

- Usually 1% is a problem

- Correlate to network bandwidth utilization

Most Ethernet chip drivers understate the collision rate. In addition to only

counting collisions in which the station was an active participant, the chip may

report 0, 1 or "more than 1" collision. Most driver implements take "more than

1" to mean 2, which in fact it could be up to 16 consecutive collisions.

Page 140: DocumentM1

Version 3.10 LISA 2006 140

©1994-2006 Hal Stern, Marc Staveley

Collisions and Switches

· Switched Ethernet cannot have collisions (*)

- Each device talks to switch independently

- No shared media worries

· Still get contention at switch under load

- Ability of switch to forward packets to right interface for output

- Ability to handle input under high loads

· Look for dropped/lost packets on switch

- Results in NFS retransmission, RPC failure, NIS timeouts, dismal TCP throughput

Page 141: DocumentM1

Version 3.10 LISA 2006 141

©1994-2006 Hal Stern, Marc Staveley

Collisions, Really Now

· Full versus Half Duplex

- Full Duplex: each node has a home run and no contention for either path to/from switch

- Half Duplex: you can still see collisions, in rare cases

· What makes switch-host collide?

- Many small packets, in steady streams

- Large segments probably are OK

Page 142: DocumentM1

Version 3.10 LISA 2006 142

©1994-2006 Hal Stern, Marc Staveley

Switches and Routers

· Bridges, Switches

- Very low latency, single IP network or VLAN

- One input pipe per server

· Routers

- Higher latency, load dependent

- Multiple pipes per server

Page 143: DocumentM1

Version 3.10 LISA 2006 143

©1994-2006 Hal Stern, Marc Staveley

Switched Architectures

· Switches offer "home run" wiring

- Each station has dedicated, bidirectional port

- Reduce contention for media (collisions = 0)

- Construct virtual LANs on switch, if needed

- "Smooth out" variations in load

- Only broadcast & multicast normally cross between network segments

· Watch for impedance mismatch at switch

- 80 clients @ 100 Mb/s swamps a 2 Gb/s backplane

Page 144: DocumentM1

Version 3.10 LISA 2006 144

©1994-2006 Hal Stern, Marc Staveley

Network Partitioning

· Switches & bridges for physical partitioning

- Corral traffic on each side of bridge

- Primary goal: reduce contention

· Routing for protocol isolation

- Non-IP traffic (NetWare)

- Broadcast isolation (NetBIOS, vocal applications)

- Non-trusted traffic (use a firewall, too)

- VLAN capability on switches for creating geographically difficult wiring schemes

Page 145: DocumentM1

Version 3.10 LISA 2006 145

©1994-2006 Hal Stern, Marc Staveley

Network Latency

Section 4.4

Page 146: DocumentM1

Version 3.10 LISA 2006 146

©1994-2006 Hal Stern, Marc Staveley

Trickle Of Data?

· Serious fragmentation at router or host

· TCP retransmission interval too short

· Real-live network loading problem

· Handshakes not completing quickly

- Nagel algorithm (slow start)

- PCs often get this wrong

- Set tcp_slow_start_initial=2 to send two segments, not just one: dramatically improves web server performance from PC's view

- tcp_slow_start_after_idle=2 as well

�inhibit the sending of new TCP segments when new outgoing data arrives from the user if any previously transmitted data on the connection remains unacknowledged.� - John Nagel (RFC 896)

Page 147: DocumentM1

Version 3.10 LISA 2006 147

©1994-2006 Hal Stern, Marc Staveley

Longer & Fatter Pipes

· Fat networks (ATM, GigE, 10GigE)

- Benefit versus cost trade-offs

- Backbone or desktop connections?

· Longer networks (WAN)

- Guaranteed capacity, grade of service?

- End to end security and integrity?

· Latency versus throughput

- Still 20 msec coast to coast

- GigE jumbo frames >> Ethernet in latency, loses for small packets

Page 148: DocumentM1

Version 3.10 LISA 2006 148

©1994-2006 Hal Stern, Marc Staveley

Long Fat Networks

40 msec

latency

Send 4 KB of data

in 3 msec over T1

Wait 70+ msec to

send more,

producing gaps

in transmit

stream

Receiver sees gaps

in data, acks as fast as

it can

first bits arrive

in 40 msec,

last bit arrives

in 43 msec

Bad TCP/IP implementations will retransmit too much because it sees the high

latency as an indication that the packet didn't arrive and retransmit it. The

retransmit timer is too small.

Page 149: DocumentM1

Version 3.10 LISA 2006 149

©1994-2006 Hal Stern, Marc Staveley

Tuning For LFNs

· Set the sender and receiver buffer size high

water marks

- Usually an ndd or kernel option, but resist temptation to make "global fix"

- Set using setsockopt() in application to avoid running out of kernel memory

· Buffer depth = 2 * bandwidth * delay product

- or bandwidth * RTT (ping)

- 1.54 Mbit/sec network (T1) with 25 msec delay = 10 KB buffer

- 155 Mbit/sec network (OC3) with 25 msec delay = 1 MB buffer

Solaris:

# increase max tcp window (maximum socket buffer size)

# max_buf = 2 x cwnd_max (congestion window)

ndd -set /dev/tcp tcp_max_buf 4194304

ndd -set /dev/tcp tcp_cwnd_max 2097152

# increase default SO_SNDBUF/SO_RCVBUF size.

ndd -set /dev/tcp tcp_xmit_hiwat 65536

ndd -set /dev/tcp tcp_recv_hiwat 65536

Linux (>= 2.4):

echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem

echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem

See http://www-didc.lbl.gov/tcp-wan.html and

http://www.psc.edu/networking/perf_tune.html for a longer explanation.

A list of tools to help determine the bandwidth of a link can be found at

http://www.caida.org/tools/taxonomy/.

Page 150: DocumentM1

Version 3.10 LISA 2006 150

©1994-2006 Hal Stern, Marc Staveley

Routing Architecture

Section 4.5

Page 151: DocumentM1

Version 3.10 LISA 2006 151

©1994-2006 Hal Stern, Marc Staveley

IP Routing

· IP is a "gas station" protocol

- Knows how to find next hop

- Makes best effort to deliver packets

· Kernel maintains routing tables

- route command adds entries

- So does routed

- Dynamic updates: ICMP redirects

Page 152: DocumentM1

Version 3.10 LISA 2006 152

©1994-2006 Hal Stern, Marc Staveley

What Goes Wrong?

· Unstable route tables (lies)

- Machines have wrong netmask or broadcast addresses

- Servers route by accident (multiple interfaces)

· Incorrect or missing routes

- Lost packets� nfs_server: bad sendreply

· Asymmetrical routes

- Performance skews for in/outbound traffic

Page 153: DocumentM1

Version 3.10 LISA 2006 153

©1994-2006 Hal Stern, Marc Staveley

RIP Updates

· Routers send RIP packets every 30 seconds

- Each router increases costs metric (cap of 15)

- Active/passive gateway notations

- /etc/gateways to seed behavior

· Default routes

- Chosen when no host or network route matches

- May produce ICMP redirects

- /etc/defaultrouter has initial value

Page 154: DocumentM1

Version 3.10 LISA 2006 154

©1994-2006 Hal Stern, Marc Staveley

Routing Architecture

· Default router or dynamic discovery

- One router or several?

- Dynamic recovery

- RDISC (RFC 1256)

· Multiple default routers

· Recovery time

- Function of network radix

Page 155: DocumentM1

Version 3.10 LISA 2006 155

©1994-2006 Hal Stern, Marc Staveley

Tips & Tricks

· Watch for IP routing on servers

- netstat -s shows IP statistics

- Consumes server CPU, network input bandwidth

· Name service dependencies

- Broken routing affects name service

- If netstat -r hangs, try netstat -rn

Page 156: DocumentM1

Version 3.10 LISA 2006 156

©1994-2006 Hal Stern, Marc Staveley

ICMP Redirects

· Packet forwarded over interface on which it

arrived

- ICMP redirect sent to transmitting host

- Sender should update routing tables

· Impact on default routes

- Implies a better choice is available

· Ignore or "fade" on host if incorrect� ndd -set /dev/ip ip_ignore_redirect 1

� ndd -set /dev/ip ip_ire_redirect_interval 15000

· Turn off to appease non-listeners� ndd -set /dev/ip ip_send_redirects 0

Page 157: DocumentM1

Version 3.10 LISA 2006 157

©1994-2006 Hal Stern, Marc Staveley

MTU Discovery

· Sending large MTU frames works routers

- Increases latency

- Do work on send side if you know MTU

· RFC 1191 - MTU discovery

- Send packet with "don't fragment" bit set

- Router returns ICMP error if too big

- Repeat with smaller frame size

- Disable with:� ndd -set /dev/ip ip_path_mtu_discovery 0

This RFC, like all others, may be found in one of the RFC repositories:

www.rfc-editor.org, www.ietf.org, www.faqs.org/rfcs

Page 158: DocumentM1

Version 3.10 LISA 2006 158

©1994-2006 Hal Stern, Marc Staveley

ARP Cache Management

· ARP cache maintains IP:MAC mappings

· May want to discard quickly

- Mobile IP addresses with multiple hardware addresses, or DHCP with rapid re-attachment

- Network equipment reboots

- HA failover when MAC address doesn't move

· Combined route/ARP entries at IP level� ndd -set /dev/ip ip_ire_cleanup_interval 30000

� ndd -set /dev/ip ip_ire_flush_interval 90000

· Local net ARP entries explicitly aged� ndd -set /dev/arp arp_cleanup_interval 60000

See SunWorld Online, February and April 1997

Page 159: DocumentM1

Version 3.10 LISA 2006 159

©1994-2006 Hal Stern, Marc Staveley

Application Architecture

Appendix A

Page 160: DocumentM1

Version 3.10 LISA 2006 160

©1994-2006 Hal Stern, Marc Staveley

Topics

· System programming

· Network programming & tuning

· Memory management

· Real-time design & data management

· Reliable computing

Page 161: DocumentM1

Version 3.10 LISA 2006 161

©1994-2006 Hal Stern, Marc Staveley

System Programming

Section A.1

Page 162: DocumentM1

Version 3.10 LISA 2006 162

©1994-2006 Hal Stern, Marc Staveley

What Can Go Wrong?

· Poor use of system calls

· Polling I/O

· Locking/semaphore operations

· Inefficient memory allocation or leaks

Page 163: DocumentM1

Version 3.10 LISA 2006 163

©1994-2006 Hal Stern, Marc Staveley

System Call Costs

· System calls are traps: serviced like page faults

· Easily abused with small buffer sizes

· Example

- read() and write() on pipe

sy cs us sy id

10 KBytes 271 41 4 12 84

1 KBytes 595 319 5 33 62

1 Byte 3733 2178 11 89 0

Page 164: DocumentM1

Version 3.10 LISA 2006 164

©1994-2006 Hal Stern, Marc Staveley

Using strace/truss

· Shows system calls and return values

- Locate calls that make process hang

- Debug permission problems

- Determine dynamic system call usage� % strace ls /lstat ("/", 0xf77ffbb8) = 0open("/", 0, 0) = 3brk(0xf210) = 0fcntl(3, 02, 0x1) = 0getdents(3, 0x9268, 8192) = 716

Using strace or truss greatly slows a process down. You're effectively putting a

kernel trap on every system call.

Collating Results

tracestat:

#!/bin/sh

awk '{

if ( $1 == "-" )

print $2

else

print $1

}' | sort | uniq -c

% strace xprocess | tracestat

13 close

87 getitimer

2957 ioctl

13 open

228 read

117 setitimer

582 sigblock

Page 165: DocumentM1

Version 3.10 LISA 2006 165

©1994-2006 Hal Stern, Marc Staveley

Synchronous Writes

· write() system call waits until disk is done

- Often 20 msec or more disk latency

- Reduces buffering/increases disk traffic

· Caused by

- Explicit flag in open()

- sync/update operation, or NFS write

- Closing file with outstanding writes (news spool)

· Typical usage

- Ensuring data delivery to disk, for strict ordering

- Side-effects

Close(2) is synchronous

waits for pending write(2)'s to complete fails if:

quota exceeded (EQUOTA)

filesystem full (EFSFULL)

Check the return value!

Page 166: DocumentM1

Version 3.10 LISA 2006 166

©1994-2006 Hal Stern, Marc Staveley

Eliminating Sync Writes

· NFS v3 or async mode in NFS v2

· Use file locking or semaphores

- Application ensures order of operations, not filesystem

- Better solution for multiple writers of same file

· Avoid open()-write()-close() loops

- Use syslog-like process for logging events

- Use database for preferences, history, environment

Page 167: DocumentM1

Version 3.10 LISA 2006 167

©1994-2006 Hal Stern, Marc Staveley

Network Programming

Section A.2

Page 168: DocumentM1

Version 3.10 LISA 2006 168

©1994-2006 Hal Stern, Marc Staveley

TCP/IP Buffering

· Segment sizes negotiated at connection

- Receiver advertises buffer up to 64K (48K)

- Sender can buffer more/less data

· Determine ideal buffer size on per-application

basis

- Global changes are harmful, can consume resources� setsockopt(..SO_RCVBUF..)

� setsockopt(..SO_SNDBUF..)

The global parameters on Solaris systems are set via ndd(1M):

tcp_xmit_hiwat, udp_xmit_hiwat for sending buffers

tcp_recv_hiwat, udp_recv_hiwat for receiving buffers

Global parameters in /sys/netinet/in_proto.c for BSD systems are:

tcp_sendspace, udp_sendspace

tcp_recvspace, udp_recvspace

Page 169: DocumentM1

Version 3.10 LISA 2006 169

©1994-2006 Hal Stern, Marc Staveley

TCP Transmit Optimization

· Small packets buffered on output

- Nagle algorithm buffers 2nd write until 1st is acknowledged

- Will delay up to 50 msec� setsockopt (..TCP_NODELAY..)

· Retransmit timer for PPP/dial-up nets� tcp_rexmit_interval_minDefault of 100 up to 1500

� tcp_rexmit_interval_initialDefault of 200 up to 2500

Page 170: DocumentM1

Version 3.10 LISA 2006 170

©1994-2006 Hal Stern, Marc Staveley

Connection Management

· High request volume floods connection queue

- BSD had implied limit of 5 connections

- Now tunable in most implementations

· Connection requires 3 host-host trips

- Client sends request to server

- Server sends packet to client

- Client completes connection

· Longer network latencies (Internet) require

deeper queue

Page 171: DocumentM1

Version 3.10 LISA 2006 171

©1994-2006 Hal Stern, Marc Staveley

Connections, Part II

· Change listen(5) to listen(20) or more

- 20-32000+ ideal for popular services like httpd� ndd -set /dev/tcp tcp_conn_req_max 100

· Socket addresses live on for 2 * MSL

- Database process crashes and restarts

- Can't bind to pre-defined address

- setsockopt(..SO_REUSEADDR..) doesn't help

· Decrease management timers� tcp_keepalive_interval (msec)

� tcp_close_wait_interval(msec) [Solaris <2.6]

� tcp_time_wait_interval (msec) [Solaris 2.6]

Determine the average backlog using a simple queuing theory formula: average

wait in a queue = service time * arrival rate

With an arrival rate of 150 requests a second, and a round trip handshake time

of 100 msec, you'll need a queue 15 requests deep. Note that 100 msec is just

about the latency of a packet from New York to California and back again.

Once you've fixed the connection and timeout problems, make sure you aren't

running out of file descriptors for key processes like inetd.

Page 172: DocumentM1

Version 3.10 LISA 2006 172

©1994-2006 Hal Stern, Marc Staveley

Memory Management

Section A.3

Page 173: DocumentM1

Version 3.10 LISA 2006 173

©1994-2006 Hal Stern, Marc Staveley

Address Space Layout

· Static areas: text, data

· Initialized data, including globals

· Uninitialized data (BSS)

· Growth

- Stack: grows down from top

- mmap: grows down from below stack limit

- Heap: grows up from top of BSS

Page 174: DocumentM1

Version 3.10 LISA 2006 174

©1994-2006 Hal Stern, Marc Staveley

Stack Management

· Local variables, parameters go on stack

· Don't put large data structures on stack

- Use malloc()

- Can damage window overlaps

Page 175: DocumentM1

Version 3.10 LISA 2006 175

©1994-2006 Hal Stern, Marc Staveley

Dynamic Memory Management

· malloc() and friends, free()

- Library calls on top of brk()

- Don't mix brk() and malloc()

- free() never shrinks heap, SZ is high-water mark

· Cell management

- malloc() puts cell size at beginning of block

- Allocates more than size requested

· Time- or space-optimized variants

- Try standard cell sizes

Page 176: DocumentM1

Version 3.10 LISA 2006 176

©1994-2006 Hal Stern, Marc Staveley

Typical Problems

· Memory leaks: SZ grows monotonically

· Address space fragmentation: MMU thrashing

· Data stride

- Access size matches cache size

- Repeatedly use 1 cache line

- Fix: move globals, resize arrays

· Use mmap() for sparsely accessed files

- More efficient than reading entire file into memory

Page 177: DocumentM1

Version 3.10 LISA 2006 177

©1994-2006 Hal Stern, Marc Staveley

mmap() or Shared Memory?

· Memory mapped files:

- Process coordination through file name

- Backed by filesystem, including NFS

- No swap space usage

- Writes may cause large page flush, better for reading

· Shared memory

- More set-up and coordination work with keys

- Backed by swap, not filesystem

- Need to explicitly write to disk

Page 178: DocumentM1

Version 3.10 LISA 2006 178

©1994-2006 Hal Stern, Marc Staveley

Real-Time Design

Section A.4

Page 179: DocumentM1

Version 3.10 LISA 2006 179

©1994-2006 Hal Stern, Marc Staveley

Why Worry About Real-Time?

· New computing problems

- Customer service with live transfer

- Real-time expectations of customers

- Web-based access

- If a user's in front of it, it's real time

· Predictable response times

- High volume transaction environment

- Threaded/concurrent programming models

· Things to learn from device drivers

Page 180: DocumentM1

Version 3.10 LISA 2006 180

©1994-2006 Hal Stern, Marc Staveley

System V Real-Time Features

· Real-time scheduling class

- Kernel pre-emption, including system calls

- Process promotion to avoid blocking chains

· No real-time network or filesystem code

· Resource allocation and "nail down"

- mlock(), plock() to lock memory/processes

· Move process into real-time class with priocntl

Page 181: DocumentM1

Version 3.10 LISA 2006 181

©1994-2006 Hal Stern, Marc Staveley

Real-Time Methodology

· Processes run for short periods

- Same model used by Windows

- Must allow scheduler to run: sleep or wait

- CPU-bound jobs will lock system

· Time quanta inversely proportional to priority

- Minimize latency to schedule key jobs

- Ship low-priority work to low-priority thread

· No filesystem/network dependencies

Page 182: DocumentM1

Version 3.10 LISA 2006 182

©1994-2006 Hal Stern, Marc Staveley

Summary

Page 183: DocumentM1

Version 3.10 LISA 2006 183

©1994-2006 Hal Stern, Marc Staveley

Parting Shots, Part 1

· Be greedy

- Solve for biggest gains first

- Don't over-tune or over-analyze

· Don't trust too much

- 3rd party code, libraries, blanket statements

- Verify information given to you by users

· Bottlenecks are converted

- Add network pipes, reduce latency, hurt server

- Fixing one problem creates 3 new ones

- Some speedups are superlinear

Page 184: DocumentM1

Version 3.10 LISA 2006 184

©1994-2006 Hal Stern, Marc Staveley

Parting Shots, Part 2

· Today's hot technology is tomorrow's capacity

headache

- Web browser caches, PCN, streaming video

- But taxing use leads to insurrection

· Rules change with each new release

- New features, new algorithms

- RFC compliance is creative art

· Nobody thanks you for being pro-active

- But you should be!