documentm1

Version 3.10 LISA 2006 1

©1994-2006 Hal Stern, Marc Staveley

System & Network

Performance Tuning

Hal SternSun Microsystems

Marc StaveleySOMA Networks

This tutorial is copyright 1994-1999 by Hal L. Stern and 1998-2006 by Marc Staveley. It may not be used in whole or part for commercial purposes without the express written permission of Hal L. Stern and Marc Staveley.

Hal Stern is a Distinguished Systems Engineer at Sun Microsystems. He was the System Administration columnist for SunWorld from February 1992 until April 1997, and previous columns and commentary are archived at: http://www.sun.com/sunworldonline.

Hal can be reached at [email protected].

Marc Staveley is the Director of IT for SOMA Networks Inc. He is a frequent speaker on the topics of standards-based development, multi-threaded programming, system administration and performance tuning.

Marc can be reached at [email protected]

Some of the material in the Notes sections has been derived from columns and articles first appearing in SunWorld, Advanced Systems and SunWorld Online. Hal thanks IDG and Michael McCarthy for their flexibility in allowing him to retain the copyrights to these pieces.

Rough agenda:

9:00 - 10:30 AM Section 111:00 - 12:30 PM Section 2 1:30 - 3:00 PM Section 3 3:30 - 5:00 PM Section 4



Syllabus

· Tuning Strategies & Expectations

· Server Tuning

· NFS Performance

· Network Design, Capacity Planning &

Performance

· Application Architecture

Some excellent books on the topic:

Raj Jain, Computer System Performance (Wiley)

Mike Loukides, System Performance Tuning (O'Reilly)

Adrian Cockcroft and Richard Pettit, Sun Performance and Tuning, Java and the

Internet (SMP/PH)

Craig Hunt, TCP/IP Network Administration (O'Reilly)

Brian Wong, Configuration and Capacity Planning for Solaris Servers

(SunSoft/PH)

Richard Mc Dougall et al. Sun Blueprints: Resource Management (SMP/PH)

Some Web resources:

Solaris Tunable Parameters Reference Manual

(http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters/)

Solaris 2 - Tuning Your TCP/IP Stack and More (http://www.sean.de/Solaris)



Tutorial Structure

· Background and internals

- Necessary to understand user-visible symptoms

· How to play doctor

- Diagnosing problems

- Rules of thumb, upper and lower bounds

· Kernel tunable parameters

- Formulae for deriving where appropriate

If you take only two things from the whole tutorial, they should be:

- Disk configuration matters

- Memory matters



Tuning Strategies &

Expectations

Section 1



Topics

· Practical goals

- Terms & conditions

- Workload characterization

· Statistics and ratios

- Monitoring intervals

· Understanding diagnostic output



Practical Goals

Section 1.1



Why Is This Hard?

Business

Transaction

Database

Transaction

Transaction Monitor

DBMS Organization

SQL Optimizer

System

CPU

Network

Latency

Disk

I/O

User

CPU

increasing

loss of

correlation

decreasing

ease of

measurement

The problem with un-correlated inputs and measurements is akin to that of

driving a car while blindfolded: the passenger steers while elbowing you to

work the gas and brakes. When your reflexes are quick, you can manage, but if

you misinterpret a signal, you end up rebooting your car.

Correlating user work with system resources is what Sun's Dtrace and

FreeBSD's ktrace attempt to do.



Social Contract Of Administration

· Why bother tuning?

- Resource utilization, purchase plans, user outrage

· Users want 10x what they have today

- sound and video today, HDTV tomorrow

- Simulation and decision support capabilities

· Application developers should share

responsibility

- Who owns educational process?

- Performance and complexity trade-off

- Load, throughput and cost evaluations

System administrators today are playing a difficult game of perception

management. Hardware prices have declined to the point where most

managers believe you can get Tandem-like fault tolerance at PC prices with no

additional software, processes or disciplines. Much of this tutorial is about

acquiring, enforcing and insisting on discipline.



Tuning Potential

· Application architecture: 1,000x

- SQL, query optimizer, caching, system calls

· Server configuration: 100x

- Disk striping, eliminate paging

· Application fine-tuning: 2-10x

- Threads, asynchronous I/O

· Kernel tuning: less than 2x on tuned system

- If kernel bottleneck is present, then 10-100x

- Kernel can be a binary performance gate

Here are some "laws" of the computing realm compared:

Moore's law predicts a doubling of CPU horsepower every 18 months, so that

gives us about a 16x improvement in 6 years.

If you look at reported transaction throughput for Unix database systems,

though, you'll see a 100x improvement in the past 6 years -- there's more than

just compute horsepower at work. What we've measured is the result of

operating systems, disks, parallelism, bus throughput and improved

applications.

An excellent discussion of "rules of thumb" as a consequence of Moore's Law is

found in Gray and Shenoy's Rules of thumb in data engineering, Microsoft

Research technical report MS-TR-99-100, Feb. 2000.



Practical Tuning Rules

· There is no "ideal" state in a fluid world

· Law of diminishing returns

- Early gains are biggest/best

- More work may not be cost-effective

· Negativism prevails

- Easy to say "This won't work"

- Hard to prove configuration can deliver on goals

· Headroom for well-tuned applications?

- Good tuning job introduces new demands

· Kaizen



Terminology: Bit Rates

· Bandwidth

- Peak of the medium, bus: what's available

- Easy to quote, hard to reach

· Throughput

- What you really get: useful data

- Protocol dependent

· Utilization

- How much you used

- Not just throughput/bandwidth

- 100% utilized with useless data: collisions

Bandwidth => Utilization => Throughput

Each measurement shows a slight (or sometimes great) loss over the previously

ordered metric.

Formal definitions:

Bandwidth: the maximum achievable throughput under ideal workload

conditions (nominal capacity)

Throughput: rate at which the requests can be serviced by the system.

Utilization: the fraction of time the resource is busy servicing requests.



Terminology: Time

· Latency

- How long you wait for something

· Response time

- What user sees: system as a black box

- Standard measuresTPC-C: transactions per minuteTPC-D: queries per hour

· Bad Things

- Knee, wall, non-linear Load

Throughput

Knee capacityUsable capacity



Example

· Bandwidth to NYC

- 10 lanes x 5 cars/s x 4 people/car = 200 pps

· Throughput

- 1 person/car (bad protocol), 1-2 cars/s (congestion)

- Parking delays (latency)

· How to fix it

- Increase number of lanes (bandwidth)

- More people per vehicle (efficient protocol)

- Eliminate toll (congestion)

- Better parking lots (reduce latency)

Tolls add to latency (since you have to stop and pay them) and also to

congestion when traffic merges back into a few lanes. Congestion from traffic

merges is another form of increased latency.

Now consider this: You wire your office with 100baseT to the desktops, feeding

into 1000baseT switched Ethernet hubs. If you run 16 desktops into each

switch,, you're merging 16 * 100 = 1600 Mbits/sec into a 1000 Mbits/sec

"tunnel".



Unit Of Work Paradox

· Unit of work is the typical "chunk size" for

- Network traffic

- Disk I/O

· Small units optimized for response time

- Network transfer latency, remote processing

· Large units optimized for protocol efficiency

- Compare ftp (~4% waste) & telnet (~90% waste)

- Ideal for large transfers like audio, video

· Where does ATM fit?

ATM uses fixed-size cells, making it ideal for audio and video that need to be

optimized for response time. Unfortunately, the cells are very small (48 bytes of

payload) so ATM incurs a large processing overhead for transfers involving

large files, like audio or video clips.



Workload Characterization

· What are the users (processes) doing?

- Estimating current & future performance

- Understanding resource utilization

· Fixed workloads

- Easy to characterize & project

· Random workloads

- Take measurements, look at facilities over time

- Tools & measurements intervals



Completeness Counts

· Random or sequential access?

- Koan of this tutorial

· Don't say: 1,000 NFS requests/second

- Read/write and attribute browsing mix?

- Average file size and lifetime?

- Working set of files?

· Don't say: 400 transactions/second

- Lookup, insert, update mix?

- Indexes used?

- Checkpoints, logs, 2-phase commit?



Statistics & Ratios

Section 1.2



Useful Metrics

· Latency over utilization

- Loaded resources may be sufficient

- What does the user see?

· Peak versus average load

- How system reacts under crunch

- What are new failure modes at peaks?

· Time to:

- Recover, repair, rebuild from faults

- Accommodate new workflow

· Managing applications



Recording Intervals

· Instantaneous data rarely useful

- Computer and business transactions long-lived

- Smooth out spikes in small neighbourhoods

· Long-term averages aren't useful either

- Peak demand periods disappear

- Can't tie resources to user functions

· Combine intervals

- 5-10 seconds for fine-grain work (OLTP)

- 10-30 seconds for peak measurement

- 10-30 minutes for coarse-grain activity (NFS)



Nyquist Frequency

· Same total load between B and D

- Peaks are different at C

- Sampling frequency determines accuracy

· Nyquist frequency is >2x "peak cycle"

- Peaks every 5 min, sample every 2.5 min

A B C D E

The total area under the two curves is about the same from "B" to "D". If you

simply measure at these endpoints and take an average, you'll think the two

loads are the same, and miss the peaks. If you measure at twice the frequency of

the peaks -- "B", "C" and "D", you'll see that peak demand is greater than the

average on the green-lined system.

The Nyquist theorem: to reconstruct a sampled input signal accurately,

sampling rate must be greater than twice the highest frequency in the input

signal.

The Nyquist frequency: the sampling rate / 2



Normal Values

· Maintain baselines

- "But it was faster on Tuesday!"

- Distinguish normal and threshold-crossing states

- Correlate to type of work being done (user model)

· Scalar proclamations aren't valuable

- CPU load without application knowledge

- Disk I/O traffic without access patterns

- Memory usage without cache hit data



Effective Ratios

· Find relationships between work and resources

- Units of work: NFS operations, DB requests

- Units of management: disks, memory, network

· Use correlated variables

- Or ratios are just randomly scaled samples

· Measure something to be improved

- Bad example: Bugs/lines of code

- Good example: collisions/packet size

· Confidence intervals

- Sensitivity of ratio & error bars (accuracy)

Be sure you can control granularity of the denominator. That is, you shouldn't

be able to cheat by increasing the denominator and lowering a cost-oriented

ratio, showing false improvement. Bugs per line of code is a miserable metric

because the code size can be inflated. Quality is the same but the metric says

you've made progress.

The accuracy of a ratio is multiplied by its sensitivity - a small understatement in

a ratio that grows superlinearly with its denominator turns into a large error.

When you multiply two inexact numbers, you also multiply their errors

together. Looking at 50 I/O operations per second, plus or minus 5 Iops is

reasonable, but 50 Iops plus or minus 45 Iops is the same as taking a guess.

The Arms index, named for Richard Arms, is sometimes called the TRIN

(Trading Index). It's a measure of correlation between the price and volume

movements of the NYSE. Instead of looking at up stocks/down stocks or up

volume/down volume, the Arms index computes

(up stocks/down stocks) / (up vol/down vol)

When the index is at 1.0, the up and down volumes reflect the number of issues

moving in each direction. An index of 0.5 means advancing issues have twice

the share volume of decliners (strong); an index over 2.0 means the decliners are

outpacing the gainers on a volume basis.



Understanding

Diagnostic Output

Section 1.3



General Guidelines

· Use whatever works for you

- Make sure you understand output format & scaling

- Know inconsistencies by platform & tool

· Ignore the first line of output

- Average since system was booted

- Interval data is more important

· Accounting system

- Source of accurate fine-grain data

- Need to turn on on most systems

Process accounting gives you detailed break-downs of resource utilization,

including the number of system calls, the amount of CPU used, and so on. This

adds at most a few percent to system overhead. While accounting can be about

5% in worst case, auditing (used for security and fine-grain access control) adds

between 10-20% overhead. Auditing tracks every operation from a user process

into the kernel.

If your system stays up for a long (100 days or more) period of time, you may

find some of the counters wrap around their 31-bit signed values, producing

negative reported values.



Standard UNIX System Tools

· vmstat, sar

- Memory, CPU and system (trap) activity

- sar has more detail, histories

- vmstat uses KB, sar uses pages

· iostat

- Disk I/O service time and operation workhorse

· nfsstat

- Client and server side data

· netstat

- TCP/IP stack internals

� pflags, pcred, pmap, pldd, psig, pstack, pfiles, pwdx, pstop, prun, pwait,

ptree, ptime: (Solaris) display various pieces of information about process(es)

in the system.

� mpstat: (Solaris, Linux): per-processor statistics, e.g. faults, inter-processor

cross-calls, interrupts, context switches, thread migrations etc.

� top (all), prstat (Solaris): show an updated view of the process in the system.

� memtool (Solaris <=8): everything you ever wanted to know about the

memory usage in a Solaris box [http://playground.sun.com/pub/memtool]

� mdb::memstat (Solaris >=9): same info as memtool

� Lockstat, Dtrace (Solaris >=10): what are the processes and kernel really

doing?

� setoolkit (Solaris, and soon others): virtual performance experts

[http://www.setoolkit.com]

� kstat (Solaris): display kernel statistics

� RRDB/ORCA/Cricket/MRTG/NRG/Smokeping/HotSaNIC/OpenNMS:

performance graphing tools

� HP Perfview: part of OpenVIEW



Accounting

· 7 processes running on a loaded system

- top or ps show "cycling" of processes on CPUs

- Which one is the pig in terms of user CPU, system calls, disk I/O initiated?

· Accounting data shows per-process info

- Memory

- CPU

- System calls

Turn on Accounting - Mike Shaprio (Distinguished Engineer at Sun, and all round kernel guru) claims tht the overhead of accounting is low. The kernel always collects the data, you just pay the I/O overhead to write it to disk.



Output Interpretation: vmstat

% vmstat 5

procs memory page disk faults cpu

r b w free re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id

1 0 0 1788 0 1 36 0 0 16 0 16 0 0 0 42 105 297 45 14 41

3 0 0 2000 0 1 60 0 0 0 0 20 0 0 0 83 197 226 38 45 18

- procs - running, blocked, swapped

- fre - free memory (not process, kernel, cache)

- re - reclaims, page freed but referenced

- at - attaches, page already in use (ie, shared library)

- pi/po - page in/out rates

- fr - page free rate

- sr - paging scan rate

Always, always drop the first line of output from system tools like vmstat. It

reflects totals/averages since the system was booted, and isn't really meaningful

data (certainly not for debugging).

You'll see the fre column start high - close to the total memory in the system -

and then sink to about 5% of the total memory over time, in systems like Solaris

(<= 2.6), Irix and other V.4 variants. This is due to file and process page caching,

and is perfectly normal.



Interpretation, Part 2

% vmstat 5procs memory page disk faults cpur b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id1 0 0 1788 0 1 36 0 0 16 0 16 0 0 0 42 105 297 45 14 413 0 0 2000 0 1 60 0 0 0 0 20 0 0 0 83 197 226 38 45 18

- disk - disk operations/sec, use iostat -D

- in - interrupts/sec, use vmstat -i

- sy - system calls/sec

- cs - context switches/sec

- us - % CPU in user mode

- sy - % CPU in system mode

- id - % CPU idle time

- swap (Solaris) - amount of swap space used

- mf (Solaris) - minor fault, did not require page in (zero fill on demand, copy on write, segmentation or bus errors)

Zero fill on demand (ZFOD) pages are paged in from /dev/zero, and produce

(as you would expect) a page filled with zeros, quite useful for the initialized

data segment of a process



Example #1


r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id

2 0 0 1788 0 1 36 0 0 0 0 6 0 0 0 42 45 297 97 2 1

3 0 0 2000 0 1 60 0 0 0 0 2 0 0 0 83 97 226 94 4 2

·High user time, little/no idle time

·Some page-in activity due to filesystem reads

·Application is CPU bound



Example #2



3 11 0 1788 0 0 34 0 0 0 0 24 10 0 0 34 272 310 25 58 17

3 10 0 2000 0 0 30 0 0 0 0 14 12 0 0 35 312 340 26 55 19

·Heavy disk activity resulting from system calls

·Heavy system CPU utilization, but still some idle

time

·Database or web server with badly tuned disks

·Lower system call rate implies NFS server, same

problems

System calls can "cause" interrupts (when I/O operations complete), network

activity, and disk activity. A high volume of network inputs (such as NFS traffic

or http requests) can cause the same effects, so it's important to dig down

another level to find the source of the load.



Example #3



3 0 0 1788 0 0 4 0 0 0 0 1 0 0 0 534 10 25 15 80 5

2 0 0 2000 0 0 3 0 0 0 0 1 0 0 0 515 12 30 15 83 2

· High interrupt rate without disk or system call

activity

· Implies network, serial port or PIO device

generating load

· Host acting as router, unsecured tty port or a

nasty token ring card



Example #4



3 3 0 1788 0 12 54 30 60 0 100 53 0 0 0 64 110 105 15 10 75

2 4 0 2000 0 10 43 28 58 0 110 41 0 0 0 60 112 130 12 10 78

· Page-in/page-out and free rates indicate VM

system is busy

· High idle time from waiting on disk

· Paging/swapping to root disk (primary swap

area)

· Machine is memory starved



Server TuningA single machine

(works for desktops too)

Section 2



Topics

· CPU utilization

· Memory consumption & paging space

· Disk I/O

· Filesystem optimizations

· Backups & redundancy



Tuning Roadmap

· Eliminate or identify CPU shortfall

· Reduce paging and fix memory problems

· Balance disk load

- Volume management

- Filesystem tuning

- Backups & integrity planning

Do steps in this order.



CPU Utilization

Section 2.1



Where Do The Cycles Go?

· > 90% user time

- Tune application code, parallelize

· > 30% system time

- User-level processes: system programming

· Kernel-level work consumes system time

- NFS, DBMS calls, httpd calls, IP routing/filtering

· NIS, DNS (named), httpd are user-level

- High system-level CPU without corresponding user-level CPU is unusual in these configurations

Perhaps the best tool for quickly identifying CPU consumers is top.

top is a graphical version of ps that runs on every Unix variant known.

A high system CPU % on an NIS or DNS server could indicate that the machine

is also acting as a router, or handling other network traffic.



Idle Time

· > 10% idle

- I/O bound, tune disks

- Input bound, tune network

· %wait, %wio are for disk I/O only

- Network I/O shows up as idle time

- RPC, NIS, NFS are not I/O waits

One possibility for high idle time is that the system is really doing nothing. This

is fine if you aren't running any jobs, but if you are expecting input and aren't

getting it, it's time to look away from the client/server and at the network. The

client trying to send on the network will show a variety of network contention &

latency problems, but the server will appear to be idle.



Multiprocessor Systems

· vmstat, sar show averages

· Example: 25% user time on 4-way host

- 4 CPUs at 25% each

- 2 CPUs at 50% each, 2 idle

- 1 CPU at 100%, 3 idle

· Apply rules on per-CPU basis

· System-specific tools for breakdown

- mpstat, psrinfo (Solaris 2.x)



A Puzzle

· Server system with framebuffer behaves well

(mostly)

· Periodically experiences major slowdown

- File service slows to crawl

- User and system CPU total near 100%

· Can never find problem on console; problem

disappears when monitoring begins



Controlling CPU Utilization

· Process "pinning"

- Maintain CPU cache warmth

- Cut down on MP bus/backplane traffic

- Unclear effects for multi-threaded processes

· Resource segregation

- Scheduler tables

- Process serialization

- Memory allocation

- E10K domains

· OS may do a better job than you do!

"pinning" in Solaris may be done with the "psr" commands: psrset, psrinfo.



Process Serialization

· Multiple user processes: good MP fit

- Memory, disk must be sufficient

· Resource problems

- # jobs > # CPUs

- sum(memory) > available memory

- Cache thrashing (VM or CPU)

· Resource management to the rescue

The key win of using a batch scheduler is that it controls usage of memory and

disk resources as well. Even if you're not CPU bound, a job scheduler can

eliminate contention for memory (discussed later) by controlling the total

memory footprint of jobs that are runnable at any one time. When you're short

on memory, 2 x 1/2 isn't 1; it's more like 0.5



Resource Management

· Job Scheduler: serialization

- Batch queue system

- Line up jobs for CPUs like bank tellers

- Manage total memory footprint

· Batch Scheduler: prioritization

- Modifies scheduler to only let some jobs run when system is idle

· Fair Share Scheduler: parallelization

- Gives groups of processes "shares" of memory and CPU

Your goal with the job scheduler is to reduce the average wait time for a job. If

the typical time to complete is 5 minutes for a job when, say, 5 jobs run in

parallel, then you should try getting the average completion time down into the

1 1/2 to 3 minute range by freeing up resources for each job to run as fast as

possible. Even though the jobs run serially, the expected time to completion is

lower when each job finishes more quickly.

A batch scheduler for Solaris is available from Sun PS's Customer Engineering

group

An example of one produced using the System V dispatch table is described in

SunWorld, July 1993



Context Switches

· What is a context switch? (cs or ctx)

- New runnable thread (kernel or user) gets CPU

- Rates vary on MP systems

· Causes

- Single running process yields to scheduler

- Interrupt makes another process runnable

- Process waits on I/O or event (signal)

· A symptom, not a problem

- With high interrupt rates: I/O activity

- With high system call rates: bad coding



Traps and System Calls

· What is a trap?

- User process requests operating system help

· Causes

- System call, page fault (common)

- Floating Point Exception

- Unimplemented instructions

- Real memory errors

· Less common traps are cause for alarm

- Wrong version of an executable

- Hardware troubles

Version mismatches:

SPARC V7 has no integer multiply/divide

SPARC V8 has imul/idiv, and optimized code uses it. When run on a SPARC

V7 machine, each imul generates an unimplemented instruction trap, which the

kernel handles through simulation, using the same user-level code the compiler

would have inserted for a V7 chip.

Symptoms of this problem: very high trap rate (on the order of thousands per

second, or about one per arithmetic operation) but no system calls. Normally, a

high trap rate is coupled with a high system call rate -- the system calls generate

traps to get the kernel's attention.



Memory Consumption

& Paging (Swap) Space

Section 2.2



Page Lifecycle

· Page creation: at boot time

· Page fills

- From file: executable, mmap()

- From process: exec()

- Zero Fill On Demand (zfod): /dev/zero

· Page uses

- Kernel and its data structures

- Process text, data, heap, stack

- File cache

· Pages backed by filesystem or paging (swap)

space

/dev/zero is the "backing store" for zero-filled pages. It produces an endless

stream of zeros -- you can map it, read it, or cat it, and you get zeros.

/dev/null is a bottomless sink – you write to it and the data disappears.

Reading from /dev/null produces an immediate end of file, not pages of

zeros.



Filesystem Cache

· System V.4 uses main memory

- Systems run with little free memory

- Available frames used for files

· Side effects

- Some page freeing normal

- All filesystem I/O is page in/out

· Solaris (>= 8)

- Uses the cyclic page cache for filesystem pages� filesystem cache lives on the free list



Paging (Swap) Space & VM

· BSD world

- Total VM = swap plus shared text segments

- Must have swap at least as large as memory

- Can run out of swap before memory

· Solaris world

- Total VM = swap + physical memory - N pages

- Can run swap-less

- Swap = physical memory "overage"

· Running out of swap

- EAGAIN, no more memory, core dumps

If you run swapless, you cannot catch a core dump after a reboot (since there is

no space for the core dump to be written).



Estimating VM Requirements

· Look at output of ps command

- RSS: resident set size, how much is in memory� Total RSS field for lower bound

- SZ: stack and data, backed by swap space� Total SZ field for upper bound, good first cut

· Memory leaks

- Processes grow� SZ increases

- Examine use of malloc()/free()

- Will exhaust paging space� may hang system

Under Solaris, you can use the memtool package to estimate VM requirements

(http://playground.sun.com/pub/memtool).

Memory leaks are covered in more detail in Section 5, as an application problem.

Your first indication that you have an issue is when you notice VM problems,

which should point back to an application problem, so we mention it here first.



Paging Principles

· Reclaim pages when memory runs low

- Start running pagedaemon (pid 2)

· Crisis avoidance

- Guiding principle of VM system

· Page small groups on demand

- Keep small pool free

- Swap to make large pools available

- Compare 200M swapped out in one step to 64M paged out in 16,000 steps

The hands of the "clock" sweep through the page structures at the same rate,

at a fixed distance apart (handspreadpages).

If the fronthand encounters a page whose reference bit is on, it turns the bit

off. When the backhand looks at the page later, it checks the bit. If the bit is

still off, nothing referenced this page since the fronthand looked at it. The

page may move onto the page freelist (or written to swap).

The rate at which the hands sweep through the page structures varies

linearly with the amount of free memory. If the amount of free memory is

lotsfree, the hands move at a minimum scan rate, slowscan. As the

amount of free memory approaches 0, the scan rate approaches fastscan.

Handspreadpages – determines the amount of time an application has to

touch a page before it will be stolen for the free list.



VM Pseudo-LRU Analysis

· pagedaemon runs every 1/4 second

- Runs for 10 msec to "sweep a bit"

· Clock algorithm

- Pages arranged in logical circle

Backhand

Fronthand

han

dsp

read

"



VM Thresholds (Solaris >= 2.6)

· Lotsfree: defaults to 1/64 of memory

- Point at which paging starts

- Up to 30% of memory (not enforced)

· desfree: panic button for swapper

- ½ lotsfree

· minfree: unconditional swapping

- ½ desfree

- low water mark for free memory



VM Thresholds (cont.)

· cachefree

- Solaris 2.6 (patch 105181-10) and Solaris 7� Not Solaris >= 8

- lotsfree * 2

- page scanner looks for unused pages that are not

claimed by executables (file system buffer cache pages)

· cachefree > lotsfree > desfree > minfree

- Strict ordering

- lotsfree-desfree gap should be big enough for a typical process creation or malloc(3) request.

If priority_paging=1 is set in /etc/system then cachefree is set to twice

lotsfree (otherwise cachefree == lotsfree), and slowscan moves to

cachefree (see next slide).

10% to 300% Desktop performance increase

Not clear if it is any good for servers, depends on the type. Typically not good for file servers.



VM Thresholds in action

Slowscan

Fastscan

minfree

desfree

lotsfree

100

8192

4MB 8MB 16MB

Free Memory

Scan Rate

cachefree

32MB

minfree is needed to allow "emergency" allocation of kernel data structures such

as socket descriptors, stacks for new threads, or new memory/VM system

structures. If you dip below minfree, you may find you can't open up new

sockets (and you'll see EAGAIN errors at user level).

The speed at which you crash through lotsfree toward minfree is driven by the

demand for memory. The faster you consume memory, the more headroom you

need above minfree to allow the system to absorb the new demand.

Solaris >= 2.6

fastscan = min( ½ mem, 64 MB)

slowscan = min( 1/20 mem, 100 pages)

handspreadpages = fastscan

Therefore all of memory is scanned in 2 (20) seconds at fastscan (slowscan) and

an application has 1 (10) seconds to reference a page before it will be put on the

free list [for a 128MB machine, like they still exist...]



Sweep Times

· Time required to scan all of memory� physmem/fastscan lower bound

� physmem/slowscan upper bound

· Shortest window for pages to be touched� handspreadpages/fastscan

· Application-dependent tuning

- Increase handspread, especially on large memory machines

- Match LRU window (coarsely) to transaction duration

As an example of an upper bound on the scanning time: consider slowscan at

100 pages/second, and a 640M machine with a 4K pagesize. That's 160K pages,

meaning a full memory scan will take 1600 seconds. Crank up the value of

fastscan to reduce the round-trip scanning time if required

The output of vmstat -S shows you how many "revolutions" the clock hands

have made. If you find the system spinning the clock hands you may be

working too hard to free too few pages of memory.

Some tuning may help for large, scientific applications that have peculiar or

very well-understood memory traffic patterns. Sequential access, for example,

benefits from faster "free behind"

Servers (systems doing lots of filesystem I/O) should set fastscan large (131072

[8KB] pages = 1GB/second)



Activity Symptoms

· Scan rate (sr), free rate (fr)

- Progress made by pagedaemon

· Pageouts (po)

- Page kicked out of memory pool, file write

· Pagein (pi)

- Page fault, filled from text/swap, file read

· Reclaim (re)

- Waiting to go to disk, brought back

· Attach (at)

- Found page already in cache (shared libraries)

If you see the scan rate (sr) and the free rate (fr) about equal, this means the

virtual memory system is releasing pages as fast as it's scanning them. Most

probably, the least-recently used algorithm has degenerated into "last scanned",

meaning that tuning the handspread or the scan rates may improve the page

selection process.



VM Problems

· Basic indicator: scanning and freeing

- Page in/out could be filesystem activity

· Swapping

- Large memory processes thrashing?

· Attaches/reclaims

- open/read/close loops on same file

· Kernel memory exhaustion

- sar -k 1 to observe

- lotsfree too close to minfree

- Will drop packets or cause malloc() failures

Chris Drake and Kimberley Brown's "Panic!" is a great reference, including a

host of kernel monitoring and sampling scripts.



Other Tunables

· maxpgio

- # swap disks * 40 (Solaris <= 2.6)

- # swap disks * 60 (Solaris == 9)

- # swap disks * 90 (Solaris >= 10)

· maxslp

- Solaris < 2.6� Deadwood timer: 20 seconds

� Set to 0xff to disable pre-emptive swapping

- Solaris >= 2.6� swap out processes sleeping for more than maxslp seconds (20) if avefree < desfree

Tuning these values produces the best returns for your effort.

maxpgio (assumes one operation per revolution * 2/3)

# swap disks * 40 for 3,600 RPM disks




[ 2/3 of the revolutions/second]

maxslp added meaning between Solaris 2.5.1 and 2.6, it is also used as the

amount of time that a process must be swapped out before being considered a

candidate to be swapped back in, in low memory conditions.



VM Diagnostics

· Add memory for fileservers

- Improve file cache hit rate

- Calculate expected/physical I/O rates

· Add memory for DBMS servers

- Configure DBMS to use it in cache

- Watch DBMS statistics for use/thrashing

- 100-300M is typical high water mark

· Add memory to eliminate scanning



Memory Mapped Files

· mmap() maps open file into address space

- Replaces open(), malloc(), read() cycles

- Improves memory profile for read-only data

- Used for text segments and shared data segments

· Mapped files pages to underlying filesystem

- Text segments paged from NFS server?

- Data files paged over network from server?

· When network performance matters...

- Use shared memory segments, paged locally

NFS-mounted executables produce sometimes unwanted effects due to the way

mmap() works over the network. When you start a Unix process (in SunOS 4.x,

or any SystemV.4/Solaris system), the executable is mapped into memory using

mmap() -- not copied into memory as in earlier BSD days. Once the executable

pages are loaded, you won't notice much difference, but if you free the pages

containing the text segment (due to paging/swapping), you're going to re-read

the data over the wire, not from the local swap device.



New VM System (Solaris >= 8)

· Page scanner is a bottleneck for the future

- new hardware supports > 512GB� 64-16M pages to scan!

· File system pressure on the VM

- high filesystem load depletes free memory list

- resulting high scan rates makes applications suffer from excessive page steals

- A server with heavy I/O pages against itself!

· Priority paging (new scanner) is not enough

· Cyclic Page Cache is the current answer

- separate pool for regular file pages

- fs flush daemon becomes fs cache daemon



Disk I/O

Section 2.3



Disk Activity

· Paging and swapping

- Memory shortfalls

· Database requests

- Lookups, log writes, index operations

· Fileserver activity

- Read, read-ahead, write requests



Disk Problems

· Unbalanced activity

- "Hot spot" contention

· Unnecessary activity

- Hit disk instead of memory

· Disks and networks are sources of greatest

gains in tuning



Diagnosing Disk Problems

· iostat -D: disk ops/second % iostat -D 5 sd0 sd1rps wps util rps wps util 8 0 22.0 40 0 90.0

- Look for excessive number of ops/disk

- Unbalanced across disks?

· iostat -x: service time (svc_t)

- Long service times (>100 msec) imply queues

- Similar to disk imbalance

- Could be disk overload (20-40 msec)

The typical seek/rotational delays on a disk are 8-15 msec. A typical transfer

takes about 20 msec. If the disk service times are consistently around 20 msec,

the disk is almost always busy. When the service times go over 20 msec, it

means that requests are piling up on the spindle: an average service time of 50

msec means that the queue is about 2.5 requests (50/20) long.

Note that for low I/O volumes, the service times are likely to be inaccurate and

on the high side. Use the service times as a thermometer for disk activity when

you're seeing a steady 10 I/O operations (iops) per second or more.



Disk Basics

· Physical things

· Disk performance

- sequential transfer rate� 5 - 40 MBytes/s

� Theoretical max: nsect * 512 * rpm / 60

- 50-100 operations/s random access

- 6-12 msec seek, 3-6 msec rotational delay

- Track-to-track versus long seeks

· Seek/rotational delays

- Access inefficiencies

While nsect * 512 * rpm tells you how fast the spinning disk platter can deliver

data, it's not completely accurate for the zone-bit recorded (ZBR) disks that are

common today. ZBR SCSI disks only fudge the nsect value in the disk

description, providing an average number of sectors per cylinder. In reality, the

first 70% of the disk is quite fast and the last 30% has a lower transfer rate.



SCSI Bus Basics

· SCSI 1 (5MHz clock rate)

- 8, 16-bit (wide), or 32-bit (fat)

- Synchronous operation yields 5 Mbyte/sec

· SCSI 2 - Fast (10MHz clock rate)

- 10 Mbytes/s with 8-bit bus

- 20 Mbytes/s with 16-bit (wide) bus

· Ultra (20MHz clock rate)

- Ultra/wide = 40MB/sec

· Ultra 2 (40MHz clock rate)

· Ultra 3 (80MHz clock rate)

If devices from different standards exist on the same SCSI bus then the clock rate

of all devices is the clock rate of the slowest device.

Ultra 3 is sometimes called Ultra 160.



SCSI Cabling Basics

· Single Ended

- 6m for SCSI 1

- 3m for SCSI 2

- 3m for Ultra up to 4 devices

- 1.5m for Ultra > 4 devices

· Differential

- 25m cabling

· Low Voltage Differential (LVD)

- 12m cabling

- used by Ultra 2 and 3

Differential signaling is used to suppress noise over long distances. If you ask a

friend to signal you with a lantern, it's easy to distinguish high (1) from low (0).

If the friend is now standing on a boat, which introduces noise (waves), it's

much harder to differentiate high and low. Instead, give your friend two

lanterns, and define "high" as "lanterns apart" and "low" as "lanterns together".

The noise affects both lanterns, but measuring the difference between them edits

the noise from the resulting signal.

If Single Ended and LVD exist on the same bus then the cabling lengths are the

minimum of the two.



Fibre Channel and iSCSI

· Industry standard at the frame level

- FC-AL: fiber channel arbitrated loop

- 100 Mbytes/sec typical

- Use switches and daisy chains to build storage networks

· Vendors layer SCSI protocol on top

- SCSI disk physics still apply

- But you can pack a lot of disks on the fiber

· Ditto iSCSI over GigE



The I/O Bottleneck

· When can't an 72GB disk hold a 500MB DB?

- When you need more than 100 I/Os per second

· How do you get > 40MByte/s file access?

- Gang disks together to "add" transfer rates

· Key info nugget #1: Access pattern

- Sequential or random, read-only or read-write

· Key info nugget #2: Access size

- 2K-8K DMBS, but varies widely

- 8K NFS v2, 32K NFS v3

- 4K-64K filesystem

Realize that when you're bound by random I/O rates, you're not moving that

much data -- the bottleneck is the physical disk arm moving across the disk

surface.

At 100 I/O operations/sec, and 8 KBytes/operations, a SCSI disk moves only

800 KBytes/sec at maximum random I/O load.

The same disk will source 40 MBytes/sec in sequential access mode, where the

disk speed and interface are the limiting factors.



Disk Striping

· Combine multiple disks into single logical disk

with new properties

- Better transfer rate

- Better average seek time

- Large capacity

· Terminology

- Block size: chunk of data on each disk in stripe

- Interleave: number of disks in stripe

- Stripe size: block size * interleave



Volume Management

· Striping done at physical (raw) level

- Run raw access processes on stripe (DBMS)

- Can build filesystem on volume, after striping

- Host (SW) or disk array (HW) solutions

· Some DBMSs do striping internally

· Bottleneck: multiple writes

- Stripe over multiple controllers, SCSI busses



Striping For Sequential I/O

· Each request hits all disks in parallel

· Add transfer rates to "lock heads"

· Block size = access size/interleave

· Examples:

- 64K filesystem access, 4 disks, 16K/disk

- 8K filesystem access, 8 disks, 1K/disk

· Can get 3-3.5x single disk

- On a 4-6 way stripe



Striping For Random I/O

· Each request should hit a different disk

· Random requests use all disks

- Force scattering of I/O

· Reduce average seek time with "independent

heads"

· Block size = access size

· Examples:

- 8K NFS access on 6 disks, 48K stripe size

- 2K DBMS access on 4 disks, 8K stripe size



Transaction Modeling

· Types: read, write, modify, insert

· Meta data structure impact

- Filesystem structures: inodes, cylinder groups, indirect blocks

- Logs and indexes for DBMS

Insert operation is R-M-W on index, W on data, W on log

Insert/update on DBMS touches data, index, log



Cache Effects

· Not every logical write I/O hits disk

- DB write clustering

- NFS, UFS dirty page clustering

- Hardware arrays may cache operations

· Reads can be cached

- DB page/block cache (Oracle SGA, e.g.)

- File/data caching in memory

· Locality of reference

- Cache can help or hurt performance



Simple DBMS Example

· Medium sized database on a busy day

- 200 users, 8 Gbyte database, 1 request/10 sec

- 50% updates, 20% inserts, 30% lookups, 4 tables, 1 index on each

· Disk I/O rate calculation

- .5 * 4/U + .2 * 3/I + .3 * 2/L = 3.2 I/O per table

- 12.8 I/O per transaction, ~10 with caching?

· Arrival rate

- 200 users * 1 / 10 secs = 20/sec

- Demand: 200 I/Os/sec, peak to 220 or more

The sample disk I/O rates are derived as follows:

Updates have to do a read, an update to an index and an update to a data block,

as well as a log write (4 transactions)

Inserts do an index and data block write, and a log write (3 transactions)

Lookups read from the index and data blocks (2 transactions)



Haste Needs Waste

· Using a single disk is a disaster

- Disk can only do 50-60 op/s, response time ~ 10/s

· 4 disks barely do the job

- Provides 200-240 I/Os/sec

- DBMS uses 90% of I/O rate capacity

· 6 disks would be better

- Waste most of the available space



Filesystem Optimization

Section 2.4



UNIX Filesystem

· Filesystem construction

- Each file identified by inode

- Inode holds permissions, modification/access times

- Points to 12 direct (data) blocks and indirect blocks

· Indirect block contains block pointers to data

blocks

· Double indirect blocks contain pointers to

blocks that contain pointers to data blocks



UFS Inode

Mode, time

Owners

Etc...

Indirect

Double ind.

12

direct

blocks

Data

Data

Data

2048

slots

2048

slots

2048

slots

Data

Data

Data

Data

Data

Data

Inode

2048

slots

2048

slots

Data

Data

Data

- Direct blocks up to 100 KBytes

- Indirect blocks up to 100 MBytes

- Double indirect blocks up to 1 TByte



Filesystem Mechanics

· Inodes are small and of fixed size

- Fast access, easy to find

· File writes flushed every 30 seconds

- Sync or update daemon

- UNIX writes are asynchronous to process

- Watch for large bursts of disk activity

· Filesystem metadata

- Create redundancy for repair after crash

- Cylinder groups, free lists, inode pointers

- fsck: scan every block for "rollback"

The fact that write() doesn't complete synchronously can cause bizarre failures.

Most code doesn't check the value of errno after a close(), but it should. Any

outstanding writes are completed synchronously when close() is called.

If any errors occurred during those writes, the error is reported back through

close(). This can cause a variety of problems when quotas are exceeded or disks

fill up (over NFS, where the server notices the disk full condition).

More details: SunWorld Online, System Administration, October 1995

http://www.sun.com/sunworldonline



The BSD Fast Filesystem

· Original UNIX filesystem

- All inodes at the beginning of the disk

- open() followed by read() always seeks

· BSD FFS improvements

- Cylinder groups keep inodes and data together

- Block placement strategy minimizes rotational delays

- Inode/cylinder group ratio governs file density

· Minfree: default 10%, safe to use 1% on 1+G

disks

McKusick, Leffler, Quaterman and Karels, "Design and Implementation of the

4.3 BSD Operating System"

mkfs and newfs always look at the # bytes per inode parameter (fixed). To

change the inode density, you need to change the number of cylinders in a

group by adjusting the number of sectors/track:

Filesystems for large files (like CAD parts files) usually have more bytes per

inode; filesystems for newsgroups should have fewer bytes per inode (with the

exception of the filesystem for alt.binaries.*)



Fragmentation & Seeks

· Fragments occur in last block of file

- Frequently less than 1% internal fragmentation

- 10% free space reduces external fragmentation

· Block placement strategy breaks down

- Avoid filling disk to > 90-95% of capacity

- Introduces rotational delays

· File ordering affects performance

- Seeking across large disk for related files



Large Files

· Reading

- Read inode, indirect block, double indirect block, data block

- Sequential access should do read-ahead

· Writing

- Update inode, (double) indirect, data blocks

- Can be up to 4 read-modify-write operations

· Large buffer sizes are more efficient

- Single access for "window" of metadata



Turbocharging Tricks

· Striping

· Journaling (logging)

- Write meta data updates to log, like DBMS

- Eliminate fsck, simply replay log

- Ideal for synchronous writes, large files

- logging option (Solaris >= 7)

· Extents

- Cluster blocks together and do read-ahead

- Eliminate more rotational delays

- Can add 2-3x performance improvement

McVoy and Kleiman, "Extent-like Performance From The UNIX Filesystem",

Winter USENIX Proceedings, 1991.

Linux also has the EXT2 filesystem, which is extent based and has different

placement policies.

Journaling and logging are often used interchangeably. Logging filesystems and

log-based filesystems, however, are not the same thing. A logging filesystem

bolts a log device onto the UNIX filesystem to accelerate writes and recovery. A

log-based filesystem is a new (non-BSD FFS) structure, based on a log of write

records. There is a long and exacting description of the differences in Margo

Seltzer's PhD thesis from UC-Berkeley.



Access Patterns

· Watch actual processes at work

- What are they doing?� nfswatch: NFS operations on the wire

� truss (Solaris, SysV.4), strace (Linux, HPUX),ktrace (*BSD)

� dtrace (Solaris >= 10)

· Application write size should match filesystem

block size.

· Use a Filesystem benchmark

- Are the disks well balanced, is the filesystem well tuned?

� filebench, bonnie

More details on using these tools: SunWorld Online, System Administration,

September 1995

http://www.sun.com/sunworldonline

Don't use process tracing for performance-sensitive issues, because turning on

system call trapping (used by the strace/truss facility) slows the process down

to a snail's pace.

Solaris Dtrace (Solaris >= 10) is more light weight.

Bonnie (http://www.textuality.com/bonnie) is a good all-round Unix

filesystem benchmark tool

Filebench extensible system to simulate many different types of workloads

http://sourceforge.net/projects/filebench/

http://www.opensolaris.org/os/community/performance/filebench/



Resource Optimization

· Optimize disk volumes by type of work

- Sequential versus random access filesystems

- Read-only versus read-write data

· Eliminate synchronous writes

- File locking or semaphores more efficient

- Journaling filesystem faster

· Watch use of symbolic links

- Often causes disk read to get link target

· Don't update the file access time for read-only

volumes

Don't update the file access time (for news and mail spools, etc.)

-o noatime

Delay updating file access time (Solaris >= 9)

-o dfratime



Non-Volatile Memory

· Battery backed memory

- RAM in disk array controller (array cache)

- disk cache

· Synchronous writes at memory write speed



Inode Cache

· Inode cache for metadata only

- Data blocks cached in VM or buffer pool

· Buffer pool for inode transit� vmstat -b

� sar -b 5 10

- Watch %rcache (read cache) hit rate

- Lower rate means more disk I/O for inodes

· Set high water mark� set bufhwm=8000 Solaris /etc/system



Directory Name Lookup Cache

· Name to inode mapping cache

· Must be large for file/user server

- Low hit rate causes disk I/O to read directories

· vmstat -S to observe

- Aim for > 90% hit rate

· Causes of low hit rates:

- File creation automatically misses

- Names > {14,32} characters not inserted� Long names not efficient

Solaris >= 2.6

- uses the ncsize parameter to set the DNLC size.

- handles long filenames in the DNLC

Solaris >= 8

- can use the kstat -n dnlcstats command to determine how well the DNLC is doing



Filesystem Replication

· Replicate popular read-only data

- Automounter or "workgroups" to segregate access

- Define update and distribution policies

· 200 coders chasing 4 class libraries

- Replicate libraries to increase bandwidth

· Hard to synchronize writeable data

- Similar to DBMS 2-phase commit problem

- Andrew filesystem (AFS)

- Stratus/ISIS rNFS, Uniq UPFS from Veritas

The ISIS Reliable NFS product is now owned by Stratus Computer,

Marlborough MA

Uniq Consulting Corp has a similar product that does N-way mirroring of NFS

volumes. Contact Kevin Sheehan at [email protected], or your local Veritas

sales rep, since Veritas is now reselling (and supporting) this product



Depth vs. Breadth

· Avoid large files if possible

- Break large files into smaller chunks

- Don't backup a 200M file for a 3-byte change

- Files > 100M require multiple I/Os per operation

· Directory search is linear

- Avoid broad directories

· Name lookup is per-component

- Avoid deep directories

- Use hash tables if required



Tuning Flush Rates

· Dirty buffers flushed every 30 seconds

- Causes large disk I/O burst

- May overload single disk

· Balance load if requests < 30s apart

- Generic update daemonwhile : do sync; sync; sleep 15; done

· Solaris tunables

- autoup: time to cover all memory

- tune_t_fsflushr: rate to flush

autoup is the oldest a dirty buffer can get before it is flushed. tune_t_fsflushr is

the rate at which the sync daemon is run; it defaults to 30 seconds.

All of memory will be covered in autoup seconds.

flushrate/autoup is the fraction covered by each pass of the update daemon.

Increase autoup, or cut the flush rate, to space out the bursts

Extremely large disk service times (in excess of 100 msec) can be caused by large

bursts from the flush daemon causing a long disk queue. If the filesystem flush

sends 20 requests to a single disk, it's likely there will be some seeking between

writes, so the 20 requests will average 20 msec each to complete. Since all disk

requests are scheduled in a single pass by fsflush, the service time for the last

one will be nearly 400 msec, while the first few will finish in around 20 msec,

yielding an average service time of 200 msec!



Backups & Redundancy

Section 2.5



Questions of Integrity

· Backups are total loss insurance

- Lose a disk

- Lose a brain: egregious rm *

· Disk integrity is inter-backup insurance

- Preserve data from high-update environment

- Time to restore backup is unacceptable

- Doesn't help with intra-day deletes

· Disaster recovery is a separate field



Disk Redundancy

· Snapshots

- Copy data to another disk or machine

- tar, dump, rdist, rsync

- Prone to failure, network load problems

· Disk mirroring (RAID 1)

- Highest level of reliability and cost

- Some small performance gains

· RAID arrays (RAID 5 and others)

- Cost/performance issues

- VLDB byte capacity

RAID = Redundant Array of Inexpensive Disks.

When the RAID levels were created (at UC-Berkeley), the popular disk format

was SMD (as in Storage Modular Device, not Surface Mounted Device).

10" platters weighed nearly 100 pounds and held 500 MB, while SCSI disks

topped out at 70 MB but cost significantly less (and were easier to lift and install)



RAID 1: Mirrored Disks

· 100% data redundancy

- Safest, most reliable

- Historically rejected due to disk count, cost

· Best performance (of all RAID types)

- Round-robin or geometric reads: like striping

- Writes at 5-10% hit

· Combine mirroring and striping

- Stripe mirrors (1+0) to survive interleave failures

- Mirror stripes (0+1) for safety with minimal overhead

RAID 0 = striping

Few systems can do 1+0

1+0 allows multi-disk failures as long as at least one mirror disk per stripe

survives.



RAID 5: Parity Disks

· Stripe parity and data over disks

· No single "parity hot spot"

· Performance degrades with more writes

- R-M-W on parity disk cuts 60%

- Similar to single-disk for reads

· Ideal for large DSS/DW databases

- If size >> performance, RAID 5 wins

- Best path to large, safe disk farm

· 20-40% cost savings



RAID 5 Tuning

· Tunables

- Array width (interleave) - sometimes

- Block size - required

· Count parity operations in I/O demand

- Read = 1 I/O

- Write = 4 I/O

· Ensure parity data is not a bottleneck

- Averaging parity disk reads and writes limited by (total) 50-60 IOP/second limit

RAID 5 write:- read original block- read parity block - xor original block with parity block - xor new block with parity block- write new block- write parity block



Backup Performance

· Derive rough transfer rate needs

- 100 GB/hour = 30 MB/second

- 5MB/s for DLT, 10MB/s for SDLT

- 15MB/s for LTO, 35MB/s for LTO-260MB/s for LTO-3

- 6MB/s for AIT, 24MB/s for AIT-4

- 80MB/sec over quiet Ethernet (GigE)

· Multiple devices increase transfer rate

- Stackers grow volume

- Drives increase bulk transfer rate

- Careful of “shoe shining”

When designing the backup system, also take into consideration the length of

time you must keep the data around. Some industries, such as financial

services, require at least a 7 year history of data for SEC or arbitration hearings.

Drug companies and health-care firms must keep patient data near-line until the

patient dies; if a drug pre-dated a patient by 5 years then you're looking at the

better part of a century.

Media types in vogue today decay. Magnetic media loses its bits; CD-ROMs

may decay after a long storage period. How will you read your backups in the

future? If you've struggled with 1600bpi tapes lately you know the feeling of

having data in your hand that's not convertible into on-line form.

Final warning: dump isn't portable! If you change vendors, make sure you can

dump and reload your data.



Backup to Disk

· Rdiff-backup

- incremental backups to disk with easy restore

· BackupPC

- incremental and full backups to disk with a web front end for scheduling and restores

- good for backing up MSwindows clients to Unix server

· Snapshots

· Offsite replicas

Is this familiar:

- Secure the individual systems

- Run aggressive password checkers

- Restrict NFS, or use NFS with Kerberos or DCE/DFS to encrypt file access

- Prevent network logins in the clear (use ssh)

- BUT: do backups over the network! Exposing the data over the network

during the backup un-does much of the effort in the other precautions.



NFS Performance Tuning

Section 3



Topics

· NFS internals

· Diagnosing server problems

· Client improvements

- Client-side caching & tuning

- NFS over WANs



NFS Internals

Section 3.1



NFS Request Execution

stat()

getattr()

nfs_getattr()

kernel RPC

port 2049 hardcoded

nfsd

getattr()

UFS stat()

HSFS stat()

% ls -l



NFS Characterization

· Small and large operations

- Small: getattr, lookup, readdir

- Large: read/write, create, readdir

· Response time matters

- Clamped at 50 msec for "reasonable" server

- Users notice 20 msec to 50 msec dropoff

· Scalability is still a concern

- Usually network limited, hard to reach capacity

- Flat response time is best measure

- Client-side demand management



NFS over TCP

· NFS/TCP is a win for:

- Wide-area networks, with higher bit error rates

- Routed networks

- Data-transfer oriented environments

- Large MTU networks, like GigE with jumbo frames

· Advantages

- Better error recovery, without complete retransmit

- Fewer retransmissions and duplicate requests

· Disadvantage

- Connection setup at mount time



NFS Version 3

· Improved cache consistency

- Attributes returned with most calls

- "access" RPC mimics permission checking of local system open() call

· Improved correctness with NFS/TCP

· Performance enhancements

- Asynchronous write operations, with logging

- Larger buffer sizes, up to 32KBytes

NFS v3 uses a 64-byte (not bit) file handle, with the actual size used per mount

negotiated between the client and server.



Diagnosing Server Problems

Section 3.2



Indicators

· Usual server tuning applies

- Don't worry about CPU utilization

- Client response time is early warning system

- Some NFS specific details

· Server isn't always the limiting factor

- Typical Ethernet supports 300-350 LADDIS ops

- To get 2,000 LADDIS: 7-8 Ethernets

LADDIS stands for Legato, Auspex, Digital, Data General, Interphase and Sun,

the 6 companies that helped produce the SPEC standard for NFS benchmarks.

LADDIS is now formally known as SPEC NFS and is reported as a number of

ops/sec, at 50 msec response time or less.

More info: [email protected]

Keith, Bruce. LADDIS: The Next Generation in NFS Server Benchmarking. spec

newsletter. March 1993. Volume 5, Issue 1.

Watson, Andy, et.al. LADDIS: A Multi-Vendor and Vendor-Neutral SPEC

NFS Benchmark. Proceedings of the LISA VI Conference, October 1992. pp. 17-

32.

Wittle, Mark, and Bruce Keith. LADDIS: The Next Generation in NFS File

Server Benchmarking. Proceedings of the Summer 1993 USENIX Conference.

July 1993. pp. 111-128.



Request Queue Depth

· nfsd daemons/threads

- One request per nfsd daemon

- Lack of nfsds makes server drop requests

- May show up as UDP socket overflows (netstat -s)

· Guidelines

- Daemons: 24-32 per server, more for many disks

- Kernel threads (Solaris): 500-2000� no penalty for being long

- Add more for browsing environment



Attribute Hammering

· Use nfsstat -s to view server statistics

· getattr > 40%

- Increase client attribute cache lifetime

- Consolidate read-only filesystems

· readlink > 5%

- Replace links with mount points

· writes > 5%

- NVRAM situation

% nfsstat -s

null getattr setattr root lookup readlink read

32 0% 527178 33% 9288 0% 0 0% 449726 28% 189466 12% 188665 15%

wrcache write create remove rename link symlink

0 0% 134797 8% 13799 0% 15826 1% 2725 0% 4388 0% 74 0%

mkdir rmdir readdir fsstat

1575 0% 1532 0% 23898 1% 242 0%

On an NFS V3 client, you'll see entries for cached writes, access calls, and other

extended RPC types.



Transfer Oriented Environments

· Ensure adequate CPU

- 1 CPU per 3-4 100BaseT Ethernets

- 1 CPU per 1.5 ATM networks at 155 Mb/s

- 1 CPU per 1000BaseT Ethernet (GigE)

· Disk balancing is critical

- Optimize for random I/O workload

· Large memory may not help

- What is working set/file lifecycle?



Client Improvements

Section 3.3



Client Tuning Overview

· Eliminate end to end problems

- Request timeouts are call to action

- 700 msec timeout versus 50 msec "pain" level

· Reduce demand with improved caching

· Adjust for line speed

- < Ethernet links

- Uncontrollable congestion

- Routers or multiple hops

· Application tuning rules apply



Client Retransmission (UDP only)

· Unanswered RPC request is retransmitted

- Repeated forever for hard mounts

- Up to 5 times (retrans) for soft mounts

· What can go wrong?

- Severe congestion (storms)

- Server dropping requests/packets

- Network losing requests or sections of them

- One lost packet kills entire request



Measuring Client Performance

· Client-side performance is what user sees

· nfsstat -m OK for NFS over UDP

- Shows average service time for lookup, read and write requests

· iostat -n with extended service times

· NFS over TCP harder to measure

- Stream-oriented, difficult to match requests and replies

- tcpdump, snoop to match XIDs in NFS header� wireshark (ethereal) does this.



Client Impatience (NFS over UPD only)

· Use nfsstat -rc

· timers > 0

- Server slower than expected

- nfsstat -m: expected response time 1400 msecretrans++

700 msecretrans++

120 msectimers++

calls++ request

% nfsstat -rc

Client rpc:

calls badcalls retrans badxid timeout wait newcred timers

224978 487 64 263 549 0 0 696

% nfsstat -m

/home/thud from thud:/home/thud (Addr 192.151.245.13)

Lookups: srtt = 7 (17 ms), dev=4 (20ms), cur=2 (40ms)

Reads: srtt=14 (35 ms), dev=3 (15ms), cur-3 (60ms)

Note that the NFS backoff and retransmit scheme is not used for NFS over TCP,

since TCP's congestion control and restart algorithms properly fit the connection

oriented model of TCP traffic. The NFS mechanism is used for UDP mounts,

and the timers used for adjusting the buffer sizes and transmit intervals are

shown with nfsstat -m. On an NFS/TCP client, the timers will be zero.

· badcalls > 0

· Soft NFS mount failures

· Operation interrupted (application failure)

· Data loss or application failures

· Should never see these



Client's Network View (NFS over UPD only)

· retrans > 5%

- Requests not reaching server or not serviced

· badxid close to 0

- Network is dropping requests

- Reduce rsize, wsize on mount

· badxid > 0

- Duplicate request cache isn't consolidating retransmissions

- Tune server, partition network

Using NFS/TCP or NFS Version 3, you'll be hard-pressed to see badxid counts

above zero. Using TCP, the NFS client doesn't have to retransmit the whole

request, only the part that was lost to the server. As a result, there should rarely

be completely retransmitted requests. NFS V3 implementations also tend to be

more "correct" than V2 implementations, since fewer requests that are not

actually idempotent (like rmdir or remove) are retransmitted.



Client Caches

· Caching critical for performance

- If data exists, don't go over the wire

- Dealing with stale data

· Cached items

- Data pages in memory: default

- Data pages on disk: eNFS, CacheFS

- File attributes: in memory

- Directory attributes: in memory

- DNLC: local name lookup cache



Attribute Caching

· getattr requests can be > 40% of total

- May hit server disk

· Read-only filesystems

- Increase actimeo to 600 or more

- "Slow start" when data really changes

· Rapidly changing filesystem (mail)

- Try noac for no caching

· File locking disables attribute and data caching

When a file is locked on the client system, that client begins to read and write

the file without any buffering. If your application calls

read(fd, buf, 128);

you'll read exactly 128 bytes over the wire from the NFS server, bypassing the

attribute cache and the local memory cache to be sure you fetch the latest copy

of the data from the server.

If file locking and strict ordering of writes are an issue, consider using a

database.



CacheFS Tips

· Read-mostly, fixed size working set

- Re-use files after loading into cache

- Write-once files are worst case

- Growing or sparse working set causes thrashing

· Watch size of cache using df

· Multi-homed hosts

- CacheFS creates one cache per host name

- Make client bindings persistent, not random

- Penalty for cold cache less than that for no server

Using CacheFS solves the page-over-the-network problem where a process' text

segment is paged from the NFS server, not from a local disk. When using large

executables (some CAD applications, FORTRAN with many common blocks),

CacheFS may improve paging performance by keeping traffic on the local host.



Buffer Sizing

· Default of 8KB good for Ethernet speeds

- At 56 Kb requires > 1 second to transmit

- Remarkably anti-social behavior

- Even worse for NFSv3 (32KB packets)

· Reduce read/write sizes on slow links

- In vfstab, automounter� rsize=1024,wsize=2048

- Match to line speed and other uses

- 256 bytes is lower limit� readdir breaks with smaller buffer

Per Packet Time to ReadLine Speed rsize Latency 1 kbyte File

56 kbaud 128 bytes 20 msec 430 msec56 kbaud 256 bytes 40 mses 310 msec224 kbaud 256 bytes 10 msec 150 msecT1 (1.5 Mbit) 1024 bytes 1 msec 42 msec



Network Design and Capacity

Planning

Section 4



Topics

· Network protocol operation

· Naming services

· Network topology

· Network latency

· Routing architecture

Network reliability colors end to end performance. If your network is delaying

traffic or losing packets, or if you suffer multiple network hops each with long

latency, you will impact what the user sees. The worst possible example is the

Internet: you get variable response time depending upon how many people are

downloading images, what current events have users flocking to the major sites,

and the time of day/day of the week.



Network Protocol

Operation

Section 4.1



Up & Down Protocol Stacks

ICMP

ARP:update cache

copy into kernel

TCP slow startTCP segmentation

IP: locate route/interfaceIP: MTU fragmentation

IP: find MAC address

Eth: send packet

RIP update

route tables

ARP: get IP mapping

backoff/re-xmit

collision

Eth: accept frame

TCP/UDP: valid port?IP re-assembly

IP: match local IP?

read() on socketwrite() on socket

Solaris exposes nearly every tunable parameter in the TCP, UDP, IP, ICMP and

ARP protocols using the ndd tool.

Find a description of the tunable parameters and their upper/lower bounds on

Richard Steven's web page containing the Appendix to his latest TCP/IP books:

http://www.kohala.com/start/tcpipiv1.appe.update1.ps

Also on docs.sun.com at http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters

Solaris 2 - Tuning Your TCP/IP Stack and More

http://www.sean.de/Solaris



Naming Services

Section 4.2



Round-Robin DNS

· Use several servers, in parallel, that have unique

IP addresses

- DNS will return all of the IP addresses in response to queries for www.blahblah.com

· Clients resolving the name get the IP addresses

in round-robin fashion

- When DNS cache entry times out, new one is requested

- Clients will wait up to DNS entry lifetime for a retry

Be sure to set the DNS server entry's Time To Live (TTL) to zero or a few

seconds, such that successive requests for the IP address of the named host get

new DNS entries

Name Servers that do Round-Robin:

- BIND 8

- djbdns

- lbnamed (true load balancer written in perl)



Round-Robin DNS, cont'd

· The Good

- No real failure management, it "just works"

- Scales very well; just add hardware and mix

- Only 1/N clients affected, on average, for N server farm (for a server failure)

· The Bad

- Clients can see minutes of "downtime" as DNS entries expire, if TTL is too long

- Can cheat with multiple A records per host, but not all clients sort them correctly

- None if done correctly



IP director

Web server

Web server

Web server

Web server

192.9.230.1

192.9.231.0

x.x.x.1

x.x.x.4

x.x.x.2

x.x.x.3

192.9.232.0

IP Redirection

IP director

www.blah.com



IP Redirection Mechanics

· Front-end boxes handle IP address resolution

- Public IP address shows up DNS maps

- Internal (private) IP networks used to distribute load

- Can have multiple networks, with multiple servers

· Improvement over DNS load balancing

- All round-robin choices made at redirector, so client's DNS configuration (or caches) don't matter

- Redirector can be made redundant

- Hosts could be redundant, too

· Cisco NetDirector, Hydra HydraWeb



Network Topology

Section 4.3



SwitchTrunking (802.3ad)

· Multiple connections to host from single switch

· Improves input and output bandwidth

- Eliminates impedance mismatch between switch and network connection

- Spread out input load on server side

· Warnings:

- Trunk can be a SPOF

- Assumes switch can handle mux/demux of traffic at peak loads

Solaris requires the SUNWtrku package (Sun Trunking Software)



Latency and Collisions

· Collisions

- CSMA/CD works "late", node backs off, tries again

- Fills Ethernet with malformed frames

· Defers

- CSMA/CD works "early"

- Not counted, but adds to latency

- Collisions "become" defers as more nodes share load

- Use netstat -k (Solaris >= 2.4) or kstat (Solarsis >= 7) to see defers and other errors

· 802.11



Dealing With Collisions

· Rate = collisions/output packets

· Collisions counted on transmit only

- Monitor on several hosts, especially busy ones

- Use netstat -i or LANalyzer to observe

- Collision rate can exceed 100% per host

· Thresholds

- Should decrease with number of nodes on net

- >5% is clear warning sign

- Usually 1% is a problem

- Correlate to network bandwidth utilization

Most Ethernet chip drivers understate the collision rate. In addition to only

counting collisions in which the station was an active participant, the chip may

report 0, 1 or "more than 1" collision. Most driver implements take "more than

1" to mean 2, which in fact it could be up to 16 consecutive collisions.



Collisions and Switches

· Switched Ethernet cannot have collisions (*)

- Each device talks to switch independently

- No shared media worries

· Still get contention at switch under load

- Ability of switch to forward packets to right interface for output

- Ability to handle input under high loads

· Look for dropped/lost packets on switch

- Results in NFS retransmission, RPC failure, NIS timeouts, dismal TCP throughput



Collisions, Really Now

· Full versus Half Duplex

- Full Duplex: each node has a home run and no contention for either path to/from switch

- Half Duplex: you can still see collisions, in rare cases

· What makes switch-host collide?

- Many small packets, in steady streams

- Large segments probably are OK



Switches and Routers

· Bridges, Switches

- Very low latency, single IP network or VLAN

- One input pipe per server

· Routers

- Higher latency, load dependent

- Multiple pipes per server



Switched Architectures

· Switches offer "home run" wiring

- Each station has dedicated, bidirectional port

- Reduce contention for media (collisions = 0)

- Construct virtual LANs on switch, if needed

- "Smooth out" variations in load

- Only broadcast & multicast normally cross between network segments

· Watch for impedance mismatch at switch

- 80 clients @ 100 Mb/s swamps a 2 Gb/s backplane



Network Partitioning

· Switches & bridges for physical partitioning

- Corral traffic on each side of bridge

- Primary goal: reduce contention

· Routing for protocol isolation

- Non-IP traffic (NetWare)

- Broadcast isolation (NetBIOS, vocal applications)

- Non-trusted traffic (use a firewall, too)

- VLAN capability on switches for creating geographically difficult wiring schemes



Network Latency

Section 4.4



Trickle Of Data?

· Serious fragmentation at router or host

· TCP retransmission interval too short

· Real-live network loading problem

· Handshakes not completing quickly

- Nagel algorithm (slow start)

- PCs often get this wrong

- Set tcp_slow_start_initial=2 to send two segments, not just one: dramatically improves web server performance from PC's view

- tcp_slow_start_after_idle=2 as well

�inhibit the sending of new TCP segments when new outgoing data arrives from the user if any previously transmitted data on the connection remains unacknowledged.� - John Nagel (RFC 896)



Longer & Fatter Pipes

· Fat networks (ATM, GigE, 10GigE)

- Benefit versus cost trade-offs

- Backbone or desktop connections?

· Longer networks (WAN)

- Guaranteed capacity, grade of service?

- End to end security and integrity?

· Latency versus throughput

- Still 20 msec coast to coast

- GigE jumbo frames >> Ethernet in latency, loses for small packets



Long Fat Networks

40 msec

latency

Send 4 KB of data

in 3 msec over T1

Wait 70+ msec to

send more,

producing gaps

in transmit

stream

Receiver sees gaps

in data, acks as fast as

it can

first bits arrive

in 40 msec,

last bit arrives

in 43 msec

Bad TCP/IP implementations will retransmit too much because it sees the high

latency as an indication that the packet didn't arrive and retransmit it. The

retransmit timer is too small.



Tuning For LFNs

· Set the sender and receiver buffer size high

water marks

- Usually an ndd or kernel option, but resist temptation to make "global fix"

- Set using setsockopt() in application to avoid running out of kernel memory

· Buffer depth = 2 * bandwidth * delay product

- or bandwidth * RTT (ping)

- 1.54 Mbit/sec network (T1) with 25 msec delay = 10 KB buffer

- 155 Mbit/sec network (OC3) with 25 msec delay = 1 MB buffer

Solaris:

# increase max tcp window (maximum socket buffer size)

# max_buf = 2 x cwnd_max (congestion window)

ndd -set /dev/tcp tcp_max_buf 4194304

ndd -set /dev/tcp tcp_cwnd_max 2097152

# increase default SO_SNDBUF/SO_RCVBUF size.

ndd -set /dev/tcp tcp_xmit_hiwat 65536

ndd -set /dev/tcp tcp_recv_hiwat 65536

Linux (>= 2.4):

echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem

echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem

See http://www-didc.lbl.gov/tcp-wan.html and

http://www.psc.edu/networking/perf_tune.html for a longer explanation.

A list of tools to help determine the bandwidth of a link can be found at

http://www.caida.org/tools/taxonomy/.



Routing Architecture

Section 4.5



IP Routing

· IP is a "gas station" protocol

- Knows how to find next hop

- Makes best effort to deliver packets

· Kernel maintains routing tables

- route command adds entries

- So does routed

- Dynamic updates: ICMP redirects



What Goes Wrong?

· Unstable route tables (lies)

- Machines have wrong netmask or broadcast addresses

- Servers route by accident (multiple interfaces)

· Incorrect or missing routes

- Lost packets� nfs_server: bad sendreply

· Asymmetrical routes

- Performance skews for in/outbound traffic



RIP Updates

· Routers send RIP packets every 30 seconds

- Each router increases costs metric (cap of 15)

- Active/passive gateway notations

- /etc/gateways to seed behavior

· Default routes

- Chosen when no host or network route matches

- May produce ICMP redirects

- /etc/defaultrouter has initial value



Routing Architecture

· Default router or dynamic discovery

- One router or several?

- Dynamic recovery

- RDISC (RFC 1256)

· Multiple default routers

· Recovery time

- Function of network radix



Tips & Tricks

· Watch for IP routing on servers

- netstat -s shows IP statistics

- Consumes server CPU, network input bandwidth

· Name service dependencies

- Broken routing affects name service

- If netstat -r hangs, try netstat -rn



ICMP Redirects

· Packet forwarded over interface on which it

arrived

- ICMP redirect sent to transmitting host

- Sender should update routing tables

· Impact on default routes

- Implies a better choice is available

· Ignore or "fade" on host if incorrect� ndd -set /dev/ip ip_ignore_redirect 1

� ndd -set /dev/ip ip_ire_redirect_interval 15000

· Turn off to appease non-listeners� ndd -set /dev/ip ip_send_redirects 0



MTU Discovery

· Sending large MTU frames works routers

- Increases latency

- Do work on send side if you know MTU

· RFC 1191 - MTU discovery

- Send packet with "don't fragment" bit set

- Router returns ICMP error if too big

- Repeat with smaller frame size

- Disable with:� ndd -set /dev/ip ip_path_mtu_discovery 0

This RFC, like all others, may be found in one of the RFC repositories:

www.rfc-editor.org, www.ietf.org, www.faqs.org/rfcs



ARP Cache Management

· ARP cache maintains IP:MAC mappings

· May want to discard quickly

- Mobile IP addresses with multiple hardware addresses, or DHCP with rapid re-attachment

- Network equipment reboots

- HA failover when MAC address doesn't move

· Combined route/ARP entries at IP level� ndd -set /dev/ip ip_ire_cleanup_interval 30000

� ndd -set /dev/ip ip_ire_flush_interval 90000

· Local net ARP entries explicitly aged� ndd -set /dev/arp arp_cleanup_interval 60000

See SunWorld Online, February and April 1997



Application Architecture

Appendix A



Topics

· System programming

· Network programming & tuning

· Memory management

· Real-time design & data management

· Reliable computing



System Programming

Section A.1



What Can Go Wrong?

· Poor use of system calls

· Polling I/O

· Locking/semaphore operations

· Inefficient memory allocation or leaks



System Call Costs

· System calls are traps: serviced like page faults

· Easily abused with small buffer sizes

· Example

- read() and write() on pipe

sy cs us sy id

10 KBytes 271 41 4 12 84

1 KBytes 595 319 5 33 62

1 Byte 3733 2178 11 89 0



Using strace/truss

· Shows system calls and return values

- Locate calls that make process hang

- Debug permission problems

- Determine dynamic system call usage� % strace ls /lstat ("/", 0xf77ffbb8) = 0open("/", 0, 0) = 3brk(0xf210) = 0fcntl(3, 02, 0x1) = 0getdents(3, 0x9268, 8192) = 716

Using strace or truss greatly slows a process down. You're effectively putting a

kernel trap on every system call.

Collating Results

tracestat:

#!/bin/sh

awk '{

if ( $1 == "-" )

print $2

else

print $1

}' | sort | uniq -c

% strace xprocess | tracestat

13 close

87 getitimer

2957 ioctl

13 open

228 read

117 setitimer

582 sigblock



Synchronous Writes

· write() system call waits until disk is done

- Often 20 msec or more disk latency

- Reduces buffering/increases disk traffic

· Caused by

- Explicit flag in open()

- sync/update operation, or NFS write

- Closing file with outstanding writes (news spool)

· Typical usage

- Ensuring data delivery to disk, for strict ordering

- Side-effects

Close(2) is synchronous

waits for pending write(2)'s to complete fails if:

quota exceeded (EQUOTA)

filesystem full (EFSFULL)

Check the return value!



Eliminating Sync Writes

· NFS v3 or async mode in NFS v2

· Use file locking or semaphores

- Application ensures order of operations, not filesystem

- Better solution for multiple writers of same file

· Avoid open()-write()-close() loops

- Use syslog-like process for logging events

- Use database for preferences, history, environment



Network Programming

Section A.2



TCP/IP Buffering

· Segment sizes negotiated at connection

- Receiver advertises buffer up to 64K (48K)

- Sender can buffer more/less data

· Determine ideal buffer size on per-application

basis

- Global changes are harmful, can consume resources� setsockopt(..SO_RCVBUF..)

� setsockopt(..SO_SNDBUF..)

The global parameters on Solaris systems are set via ndd(1M):

tcp_xmit_hiwat, udp_xmit_hiwat for sending buffers

tcp_recv_hiwat, udp_recv_hiwat for receiving buffers

Global parameters in /sys/netinet/in_proto.c for BSD systems are:

tcp_sendspace, udp_sendspace

tcp_recvspace, udp_recvspace



TCP Transmit Optimization

· Small packets buffered on output

- Nagle algorithm buffers 2nd write until 1st is acknowledged

- Will delay up to 50 msec� setsockopt (..TCP_NODELAY..)

· Retransmit timer for PPP/dial-up nets� tcp_rexmit_interval_minDefault of 100 up to 1500

� tcp_rexmit_interval_initialDefault of 200 up to 2500



Connection Management

· High request volume floods connection queue

- BSD had implied limit of 5 connections

- Now tunable in most implementations

· Connection requires 3 host-host trips

- Client sends request to server

- Server sends packet to client

- Client completes connection

· Longer network latencies (Internet) require

deeper queue



Connections, Part II

· Change listen(5) to listen(20) or more

- 20-32000+ ideal for popular services like httpd� ndd -set /dev/tcp tcp_conn_req_max 100

· Socket addresses live on for 2 * MSL

- Database process crashes and restarts

- Can't bind to pre-defined address

- setsockopt(..SO_REUSEADDR..) doesn't help

· Decrease management timers� tcp_keepalive_interval (msec)

� tcp_close_wait_interval(msec) [Solaris <2.6]

� tcp_time_wait_interval (msec) [Solaris 2.6]

Determine the average backlog using a simple queuing theory formula: average

wait in a queue = service time * arrival rate

With an arrival rate of 150 requests a second, and a round trip handshake time

of 100 msec, you'll need a queue 15 requests deep. Note that 100 msec is just

about the latency of a packet from New York to California and back again.

Once you've fixed the connection and timeout problems, make sure you aren't

running out of file descriptors for key processes like inetd.



Memory Management

Section A.3



Address Space Layout

· Static areas: text, data

· Initialized data, including globals

· Uninitialized data (BSS)

· Growth

- Stack: grows down from top

- mmap: grows down from below stack limit

- Heap: grows up from top of BSS



Stack Management

· Local variables, parameters go on stack

· Don't put large data structures on stack

- Use malloc()

- Can damage window overlaps



Dynamic Memory Management

· malloc() and friends, free()

- Library calls on top of brk()

- Don't mix brk() and malloc()

- free() never shrinks heap, SZ is high-water mark

· Cell management

- malloc() puts cell size at beginning of block

- Allocates more than size requested

· Time- or space-optimized variants

- Try standard cell sizes



Typical Problems

· Memory leaks: SZ grows monotonically

· Address space fragmentation: MMU thrashing

· Data stride

- Access size matches cache size

- Repeatedly use 1 cache line

- Fix: move globals, resize arrays

· Use mmap() for sparsely accessed files

- More efficient than reading entire file into memory



mmap() or Shared Memory?

· Memory mapped files:

- Process coordination through file name

- Backed by filesystem, including NFS

- No swap space usage

- Writes may cause large page flush, better for reading

· Shared memory

- More set-up and coordination work with keys

- Backed by swap, not filesystem

- Need to explicitly write to disk



Real-Time Design

Section A.4



Why Worry About Real-Time?

· New computing problems

- Customer service with live transfer

- Real-time expectations of customers

- Web-based access

- If a user's in front of it, it's real time

· Predictable response times

- High volume transaction environment

- Threaded/concurrent programming models

· Things to learn from device drivers



System V Real-Time Features

· Real-time scheduling class

- Kernel pre-emption, including system calls

- Process promotion to avoid blocking chains

· No real-time network or filesystem code

· Resource allocation and "nail down"

- mlock(), plock() to lock memory/processes

· Move process into real-time class with priocntl



Real-Time Methodology

· Processes run for short periods

- Same model used by Windows

- Must allow scheduler to run: sleep or wait

- CPU-bound jobs will lock system

· Time quanta inversely proportional to priority

- Minimize latency to schedule key jobs

- Ship low-priority work to low-priority thread

· No filesystem/network dependencies



Summary



Parting Shots, Part 1

· Be greedy

- Solve for biggest gains first

- Don't over-tune or over-analyze

· Don't trust too much

- 3rd party code, libraries, blanket statements

- Verify information given to you by users

· Bottlenecks are converted

- Add network pipes, reduce latency, hurt server

- Fixing one problem creates 3 new ones

- Some speedups are superlinear



Parting Shots, Part 2

· Today's hot technology is tomorrow's capacity

headache

- Web browser caches, PCN, streaming video

- But taxing use leads to insurrection

· Rules change with each new release

- New features, new algorithms

- RFC compliance is creative art

· Nobody thanks you for being pro-active

- But you should be!

documentm1

Documents