documentm1
TRANSCRIPT
Version 3.10 LISA 2006 1
©1994-2006 Hal Stern, Marc Staveley
System & Network
Performance Tuning
Hal SternSun Microsystems
Marc StaveleySOMA Networks
This tutorial is copyright 1994-1999 by Hal L. Stern and 1998-2006 by Marc Staveley. It may not be used in whole or part for commercial purposes without the express written permission of Hal L. Stern and Marc Staveley.
Hal Stern is a Distinguished Systems Engineer at Sun Microsystems. He was the System Administration columnist for SunWorld from February 1992 until April 1997, and previous columns and commentary are archived at: http://www.sun.com/sunworldonline.
Hal can be reached at [email protected].
Marc Staveley is the Director of IT for SOMA Networks Inc. He is a frequent speaker on the topics of standards-based development, multi-threaded programming, system administration and performance tuning.
Marc can be reached at [email protected]
Some of the material in the Notes sections has been derived from columns and articles first appearing in SunWorld, Advanced Systems and SunWorld Online. Hal thanks IDG and Michael McCarthy for their flexibility in allowing him to retain the copyrights to these pieces.
Rough agenda:
9:00 - 10:30 AM Section 111:00 - 12:30 PM Section 2 1:30 - 3:00 PM Section 3 3:30 - 5:00 PM Section 4
Version 3.10 LISA 2006 2
©1994-2006 Hal Stern, Marc Staveley
Syllabus
· Tuning Strategies & Expectations
· Server Tuning
· NFS Performance
· Network Design, Capacity Planning &
Performance
· Application Architecture
Some excellent books on the topic:
Raj Jain, Computer System Performance (Wiley)
Mike Loukides, System Performance Tuning (O'Reilly)
Adrian Cockcroft and Richard Pettit, Sun Performance and Tuning, Java and the
Internet (SMP/PH)
Craig Hunt, TCP/IP Network Administration (O'Reilly)
Brian Wong, Configuration and Capacity Planning for Solaris Servers
(SunSoft/PH)
Richard Mc Dougall et al. Sun Blueprints: Resource Management (SMP/PH)
Some Web resources:
Solaris Tunable Parameters Reference Manual
(http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters/)
Solaris 2 - Tuning Your TCP/IP Stack and More (http://www.sean.de/Solaris)
Version 3.10 LISA 2006 3
©1994-2006 Hal Stern, Marc Staveley
Tutorial Structure
· Background and internals
- Necessary to understand user-visible symptoms
· How to play doctor
- Diagnosing problems
- Rules of thumb, upper and lower bounds
· Kernel tunable parameters
- Formulae for deriving where appropriate
If you take only two things from the whole tutorial, they should be:
- Disk configuration matters
- Memory matters
Version 3.10 LISA 2006 4
©1994-2006 Hal Stern, Marc Staveley
Tuning Strategies &
Expectations
Section 1
Version 3.10 LISA 2006 5
©1994-2006 Hal Stern, Marc Staveley
Topics
· Practical goals
- Terms & conditions
- Workload characterization
· Statistics and ratios
- Monitoring intervals
· Understanding diagnostic output
Version 3.10 LISA 2006 6
©1994-2006 Hal Stern, Marc Staveley
Practical Goals
Section 1.1
Version 3.10 LISA 2006 7
©1994-2006 Hal Stern, Marc Staveley
Why Is This Hard?
Business
Transaction
Database
Transaction
Transaction Monitor
DBMS Organization
SQL Optimizer
System
CPU
Network
Latency
Disk
I/O
User
CPU
increasing
loss of
correlation
decreasing
ease of
measurement
The problem with un-correlated inputs and measurements is akin to that of
driving a car while blindfolded: the passenger steers while elbowing you to
work the gas and brakes. When your reflexes are quick, you can manage, but if
you misinterpret a signal, you end up rebooting your car.
Correlating user work with system resources is what Sun's Dtrace and
FreeBSD's ktrace attempt to do.
Version 3.10 LISA 2006 8
©1994-2006 Hal Stern, Marc Staveley
Social Contract Of Administration
· Why bother tuning?
- Resource utilization, purchase plans, user outrage
· Users want 10x what they have today
- sound and video today, HDTV tomorrow
- Simulation and decision support capabilities
· Application developers should share
responsibility
- Who owns educational process?
- Performance and complexity trade-off
- Load, throughput and cost evaluations
System administrators today are playing a difficult game of perception
management. Hardware prices have declined to the point where most
managers believe you can get Tandem-like fault tolerance at PC prices with no
additional software, processes or disciplines. Much of this tutorial is about
acquiring, enforcing and insisting on discipline.
Version 3.10 LISA 2006 9
©1994-2006 Hal Stern, Marc Staveley
Tuning Potential
· Application architecture: 1,000x
- SQL, query optimizer, caching, system calls
· Server configuration: 100x
- Disk striping, eliminate paging
· Application fine-tuning: 2-10x
- Threads, asynchronous I/O
· Kernel tuning: less than 2x on tuned system
- If kernel bottleneck is present, then 10-100x
- Kernel can be a binary performance gate
Here are some "laws" of the computing realm compared:
Moore's law predicts a doubling of CPU horsepower every 18 months, so that
gives us about a 16x improvement in 6 years.
If you look at reported transaction throughput for Unix database systems,
though, you'll see a 100x improvement in the past 6 years -- there's more than
just compute horsepower at work. What we've measured is the result of
operating systems, disks, parallelism, bus throughput and improved
applications.
An excellent discussion of "rules of thumb" as a consequence of Moore's Law is
found in Gray and Shenoy's Rules of thumb in data engineering, Microsoft
Research technical report MS-TR-99-100, Feb. 2000.
Version 3.10 LISA 2006 10
©1994-2006 Hal Stern, Marc Staveley
Practical Tuning Rules
· There is no "ideal" state in a fluid world
· Law of diminishing returns
- Early gains are biggest/best
- More work may not be cost-effective
· Negativism prevails
- Easy to say "This won't work"
- Hard to prove configuration can deliver on goals
· Headroom for well-tuned applications?
- Good tuning job introduces new demands
· Kaizen
Version 3.10 LISA 2006 11
©1994-2006 Hal Stern, Marc Staveley
Terminology: Bit Rates
· Bandwidth
- Peak of the medium, bus: what's available
- Easy to quote, hard to reach
· Throughput
- What you really get: useful data
- Protocol dependent
· Utilization
- How much you used
- Not just throughput/bandwidth
- 100% utilized with useless data: collisions
Bandwidth => Utilization => Throughput
Each measurement shows a slight (or sometimes great) loss over the previously
ordered metric.
Formal definitions:
Bandwidth: the maximum achievable throughput under ideal workload
conditions (nominal capacity)
Throughput: rate at which the requests can be serviced by the system.
Utilization: the fraction of time the resource is busy servicing requests.
Version 3.10 LISA 2006 12
©1994-2006 Hal Stern, Marc Staveley
Terminology: Time
· Latency
- How long you wait for something
· Response time
- What user sees: system as a black box
- Standard measuresTPC-C: transactions per minuteTPC-D: queries per hour
· Bad Things
- Knee, wall, non-linear Load
Throughput
Knee capacityUsable capacity
Version 3.10 LISA 2006 13
©1994-2006 Hal Stern, Marc Staveley
Example
· Bandwidth to NYC
- 10 lanes x 5 cars/s x 4 people/car = 200 pps
· Throughput
- 1 person/car (bad protocol), 1-2 cars/s (congestion)
- Parking delays (latency)
· How to fix it
- Increase number of lanes (bandwidth)
- More people per vehicle (efficient protocol)
- Eliminate toll (congestion)
- Better parking lots (reduce latency)
Tolls add to latency (since you have to stop and pay them) and also to
congestion when traffic merges back into a few lanes. Congestion from traffic
merges is another form of increased latency.
Now consider this: You wire your office with 100baseT to the desktops, feeding
into 1000baseT switched Ethernet hubs. If you run 16 desktops into each
switch,, you're merging 16 * 100 = 1600 Mbits/sec into a 1000 Mbits/sec
"tunnel".
Version 3.10 LISA 2006 14
©1994-2006 Hal Stern, Marc Staveley
Unit Of Work Paradox
· Unit of work is the typical "chunk size" for
- Network traffic
- Disk I/O
· Small units optimized for response time
- Network transfer latency, remote processing
· Large units optimized for protocol efficiency
- Compare ftp (~4% waste) & telnet (~90% waste)
- Ideal for large transfers like audio, video
· Where does ATM fit?
ATM uses fixed-size cells, making it ideal for audio and video that need to be
optimized for response time. Unfortunately, the cells are very small (48 bytes of
payload) so ATM incurs a large processing overhead for transfers involving
large files, like audio or video clips.
Version 3.10 LISA 2006 15
©1994-2006 Hal Stern, Marc Staveley
Workload Characterization
· What are the users (processes) doing?
- Estimating current & future performance
- Understanding resource utilization
· Fixed workloads
- Easy to characterize & project
· Random workloads
- Take measurements, look at facilities over time
- Tools & measurements intervals
Version 3.10 LISA 2006 16
©1994-2006 Hal Stern, Marc Staveley
Completeness Counts
· Random or sequential access?
- Koan of this tutorial
· Don't say: 1,000 NFS requests/second
- Read/write and attribute browsing mix?
- Average file size and lifetime?
- Working set of files?
· Don't say: 400 transactions/second
- Lookup, insert, update mix?
- Indexes used?
- Checkpoints, logs, 2-phase commit?
Version 3.10 LISA 2006 17
©1994-2006 Hal Stern, Marc Staveley
Statistics & Ratios
Section 1.2
Version 3.10 LISA 2006 18
©1994-2006 Hal Stern, Marc Staveley
Useful Metrics
· Latency over utilization
- Loaded resources may be sufficient
- What does the user see?
· Peak versus average load
- How system reacts under crunch
- What are new failure modes at peaks?
· Time to:
- Recover, repair, rebuild from faults
- Accommodate new workflow
· Managing applications
Version 3.10 LISA 2006 19
©1994-2006 Hal Stern, Marc Staveley
Recording Intervals
· Instantaneous data rarely useful
- Computer and business transactions long-lived
- Smooth out spikes in small neighbourhoods
· Long-term averages aren't useful either
- Peak demand periods disappear
- Can't tie resources to user functions
· Combine intervals
- 5-10 seconds for fine-grain work (OLTP)
- 10-30 seconds for peak measurement
- 10-30 minutes for coarse-grain activity (NFS)
Version 3.10 LISA 2006 20
©1994-2006 Hal Stern, Marc Staveley
Nyquist Frequency
· Same total load between B and D
- Peaks are different at C
- Sampling frequency determines accuracy
· Nyquist frequency is >2x "peak cycle"
- Peaks every 5 min, sample every 2.5 min
A B C D E
The total area under the two curves is about the same from "B" to "D". If you
simply measure at these endpoints and take an average, you'll think the two
loads are the same, and miss the peaks. If you measure at twice the frequency of
the peaks -- "B", "C" and "D", you'll see that peak demand is greater than the
average on the green-lined system.
The Nyquist theorem: to reconstruct a sampled input signal accurately,
sampling rate must be greater than twice the highest frequency in the input
signal.
The Nyquist frequency: the sampling rate / 2
Version 3.10 LISA 2006 21
©1994-2006 Hal Stern, Marc Staveley
Normal Values
· Maintain baselines
- "But it was faster on Tuesday!"
- Distinguish normal and threshold-crossing states
- Correlate to type of work being done (user model)
· Scalar proclamations aren't valuable
- CPU load without application knowledge
- Disk I/O traffic without access patterns
- Memory usage without cache hit data
Version 3.10 LISA 2006 22
©1994-2006 Hal Stern, Marc Staveley
Effective Ratios
· Find relationships between work and resources
- Units of work: NFS operations, DB requests
- Units of management: disks, memory, network
· Use correlated variables
- Or ratios are just randomly scaled samples
· Measure something to be improved
- Bad example: Bugs/lines of code
- Good example: collisions/packet size
· Confidence intervals
- Sensitivity of ratio & error bars (accuracy)
Be sure you can control granularity of the denominator. That is, you shouldn't
be able to cheat by increasing the denominator and lowering a cost-oriented
ratio, showing false improvement. Bugs per line of code is a miserable metric
because the code size can be inflated. Quality is the same but the metric says
you've made progress.
The accuracy of a ratio is multiplied by its sensitivity - a small understatement in
a ratio that grows superlinearly with its denominator turns into a large error.
When you multiply two inexact numbers, you also multiply their errors
together. Looking at 50 I/O operations per second, plus or minus 5 Iops is
reasonable, but 50 Iops plus or minus 45 Iops is the same as taking a guess.
The Arms index, named for Richard Arms, is sometimes called the TRIN
(Trading Index). It's a measure of correlation between the price and volume
movements of the NYSE. Instead of looking at up stocks/down stocks or up
volume/down volume, the Arms index computes
(up stocks/down stocks) / (up vol/down vol)
When the index is at 1.0, the up and down volumes reflect the number of issues
moving in each direction. An index of 0.5 means advancing issues have twice
the share volume of decliners (strong); an index over 2.0 means the decliners are
outpacing the gainers on a volume basis.
Version 3.10 LISA 2006 23
©1994-2006 Hal Stern, Marc Staveley
Understanding
Diagnostic Output
Section 1.3
Version 3.10 LISA 2006 24
©1994-2006 Hal Stern, Marc Staveley
General Guidelines
· Use whatever works for you
- Make sure you understand output format & scaling
- Know inconsistencies by platform & tool
· Ignore the first line of output
- Average since system was booted
- Interval data is more important
· Accounting system
- Source of accurate fine-grain data
- Need to turn on on most systems
Process accounting gives you detailed break-downs of resource utilization,
including the number of system calls, the amount of CPU used, and so on. This
adds at most a few percent to system overhead. While accounting can be about
5% in worst case, auditing (used for security and fine-grain access control) adds
between 10-20% overhead. Auditing tracks every operation from a user process
into the kernel.
If your system stays up for a long (100 days or more) period of time, you may
find some of the counters wrap around their 31-bit signed values, producing
negative reported values.
Version 3.10 LISA 2006 25
©1994-2006 Hal Stern, Marc Staveley
Standard UNIX System Tools
· vmstat, sar
- Memory, CPU and system (trap) activity
- sar has more detail, histories
- vmstat uses KB, sar uses pages
· iostat
- Disk I/O service time and operation workhorse
· nfsstat
- Client and server side data
· netstat
- TCP/IP stack internals
� pflags, pcred, pmap, pldd, psig, pstack, pfiles, pwdx, pstop, prun, pwait,
ptree, ptime: (Solaris) display various pieces of information about process(es)
in the system.
� mpstat: (Solaris, Linux): per-processor statistics, e.g. faults, inter-processor
cross-calls, interrupts, context switches, thread migrations etc.
� top (all), prstat (Solaris): show an updated view of the process in the system.
� memtool (Solaris <=8): everything you ever wanted to know about the
memory usage in a Solaris box [http://playground.sun.com/pub/memtool]
� mdb::memstat (Solaris >=9): same info as memtool
� Lockstat, Dtrace (Solaris >=10): what are the processes and kernel really
doing?
� setoolkit (Solaris, and soon others): virtual performance experts
[http://www.setoolkit.com]
� kstat (Solaris): display kernel statistics
� RRDB/ORCA/Cricket/MRTG/NRG/Smokeping/HotSaNIC/OpenNMS:
performance graphing tools
� HP Perfview: part of OpenVIEW
Version 3.10 LISA 2006 26
©1994-2006 Hal Stern, Marc Staveley
Accounting
· 7 processes running on a loaded system
- top or ps show "cycling" of processes on CPUs
- Which one is the pig in terms of user CPU, system calls, disk I/O initiated?
· Accounting data shows per-process info
- Memory
- CPU
- System calls
Turn on Accounting - Mike Shaprio (Distinguished Engineer at Sun, and all round kernel guru) claims tht the overhead of accounting is low. The kernel always collects the data, you just pay the I/O overhead to write it to disk.
Version 3.10 LISA 2006 27
©1994-2006 Hal Stern, Marc Staveley
Output Interpretation: vmstat
% vmstat 5
procs memory page disk faults cpu
r b w free re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
1 0 0 1788 0 1 36 0 0 16 0 16 0 0 0 42 105 297 45 14 41
3 0 0 2000 0 1 60 0 0 0 0 20 0 0 0 83 197 226 38 45 18
- procs - running, blocked, swapped
- fre - free memory (not process, kernel, cache)
- re - reclaims, page freed but referenced
- at - attaches, page already in use (ie, shared library)
- pi/po - page in/out rates
- fr - page free rate
- sr - paging scan rate
Always, always drop the first line of output from system tools like vmstat. It
reflects totals/averages since the system was booted, and isn't really meaningful
data (certainly not for debugging).
You'll see the fre column start high - close to the total memory in the system -
and then sink to about 5% of the total memory over time, in systems like Solaris
(<= 2.6), Irix and other V.4 variants. This is due to file and process page caching,
and is perfectly normal.
Version 3.10 LISA 2006 28
©1994-2006 Hal Stern, Marc Staveley
Interpretation, Part 2
% vmstat 5procs memory page disk faults cpur b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id1 0 0 1788 0 1 36 0 0 16 0 16 0 0 0 42 105 297 45 14 413 0 0 2000 0 1 60 0 0 0 0 20 0 0 0 83 197 226 38 45 18
- disk - disk operations/sec, use iostat -D
- in - interrupts/sec, use vmstat -i
- sy - system calls/sec
- cs - context switches/sec
- us - % CPU in user mode
- sy - % CPU in system mode
- id - % CPU idle time
- swap (Solaris) - amount of swap space used
- mf (Solaris) - minor fault, did not require page in (zero fill on demand, copy on write, segmentation or bus errors)
Zero fill on demand (ZFOD) pages are paged in from /dev/zero, and produce
(as you would expect) a page filled with zeros, quite useful for the initialized
data segment of a process
Version 3.10 LISA 2006 29
©1994-2006 Hal Stern, Marc Staveley
Example #1
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
2 0 0 1788 0 1 36 0 0 0 0 6 0 0 0 42 45 297 97 2 1
3 0 0 2000 0 1 60 0 0 0 0 2 0 0 0 83 97 226 94 4 2
·High user time, little/no idle time
·Some page-in activity due to filesystem reads
·Application is CPU bound
Version 3.10 LISA 2006 30
©1994-2006 Hal Stern, Marc Staveley
Example #2
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
3 11 0 1788 0 0 34 0 0 0 0 24 10 0 0 34 272 310 25 58 17
3 10 0 2000 0 0 30 0 0 0 0 14 12 0 0 35 312 340 26 55 19
·Heavy disk activity resulting from system calls
·Heavy system CPU utilization, but still some idle
time
·Database or web server with badly tuned disks
·Lower system call rate implies NFS server, same
problems
System calls can "cause" interrupts (when I/O operations complete), network
activity, and disk activity. A high volume of network inputs (such as NFS traffic
or http requests) can cause the same effects, so it's important to dig down
another level to find the source of the load.
Version 3.10 LISA 2006 31
©1994-2006 Hal Stern, Marc Staveley
Example #3
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
3 0 0 1788 0 0 4 0 0 0 0 1 0 0 0 534 10 25 15 80 5
2 0 0 2000 0 0 3 0 0 0 0 1 0 0 0 515 12 30 15 83 2
· High interrupt rate without disk or system call
activity
· Implies network, serial port or PIO device
generating load
· Host acting as router, unsecured tty port or a
nasty token ring card
Version 3.10 LISA 2006 32
©1994-2006 Hal Stern, Marc Staveley
Example #4
procs memory page disk faults cpu
r b w fre re at pi po fr de sr s0 s1 s2 d3 in sy cs us sy id
3 3 0 1788 0 12 54 30 60 0 100 53 0 0 0 64 110 105 15 10 75
2 4 0 2000 0 10 43 28 58 0 110 41 0 0 0 60 112 130 12 10 78
· Page-in/page-out and free rates indicate VM
system is busy
· High idle time from waiting on disk
· Paging/swapping to root disk (primary swap
area)
· Machine is memory starved
Version 3.10 LISA 2006 33
©1994-2006 Hal Stern, Marc Staveley
Server TuningA single machine
(works for desktops too)
Section 2
Version 3.10 LISA 2006 34
©1994-2006 Hal Stern, Marc Staveley
Topics
· CPU utilization
· Memory consumption & paging space
· Disk I/O
· Filesystem optimizations
· Backups & redundancy
Version 3.10 LISA 2006 35
©1994-2006 Hal Stern, Marc Staveley
Tuning Roadmap
· Eliminate or identify CPU shortfall
· Reduce paging and fix memory problems
· Balance disk load
- Volume management
- Filesystem tuning
- Backups & integrity planning
Do steps in this order.
Version 3.10 LISA 2006 36
©1994-2006 Hal Stern, Marc Staveley
CPU Utilization
Section 2.1
Version 3.10 LISA 2006 37
©1994-2006 Hal Stern, Marc Staveley
Where Do The Cycles Go?
· > 90% user time
- Tune application code, parallelize
· > 30% system time
- User-level processes: system programming
· Kernel-level work consumes system time
- NFS, DBMS calls, httpd calls, IP routing/filtering
· NIS, DNS (named), httpd are user-level
- High system-level CPU without corresponding user-level CPU is unusual in these configurations
Perhaps the best tool for quickly identifying CPU consumers is top.
top is a graphical version of ps that runs on every Unix variant known.
A high system CPU % on an NIS or DNS server could indicate that the machine
is also acting as a router, or handling other network traffic.
Version 3.10 LISA 2006 38
©1994-2006 Hal Stern, Marc Staveley
Idle Time
· > 10% idle
- I/O bound, tune disks
- Input bound, tune network
· %wait, %wio are for disk I/O only
- Network I/O shows up as idle time
- RPC, NIS, NFS are not I/O waits
One possibility for high idle time is that the system is really doing nothing. This
is fine if you aren't running any jobs, but if you are expecting input and aren't
getting it, it's time to look away from the client/server and at the network. The
client trying to send on the network will show a variety of network contention &
latency problems, but the server will appear to be idle.
Version 3.10 LISA 2006 39
©1994-2006 Hal Stern, Marc Staveley
Multiprocessor Systems
· vmstat, sar show averages
· Example: 25% user time on 4-way host
- 4 CPUs at 25% each
- 2 CPUs at 50% each, 2 idle
- 1 CPU at 100%, 3 idle
· Apply rules on per-CPU basis
· System-specific tools for breakdown
- mpstat, psrinfo (Solaris 2.x)
Version 3.10 LISA 2006 40
©1994-2006 Hal Stern, Marc Staveley
A Puzzle
· Server system with framebuffer behaves well
(mostly)
· Periodically experiences major slowdown
- File service slows to crawl
- User and system CPU total near 100%
· Can never find problem on console; problem
disappears when monitoring begins
Version 3.10 LISA 2006 41
©1994-2006 Hal Stern, Marc Staveley
Controlling CPU Utilization
· Process "pinning"
- Maintain CPU cache warmth
- Cut down on MP bus/backplane traffic
- Unclear effects for multi-threaded processes
· Resource segregation
- Scheduler tables
- Process serialization
- Memory allocation
- E10K domains
· OS may do a better job than you do!
"pinning" in Solaris may be done with the "psr" commands: psrset, psrinfo.
Version 3.10 LISA 2006 42
©1994-2006 Hal Stern, Marc Staveley
Process Serialization
· Multiple user processes: good MP fit
- Memory, disk must be sufficient
· Resource problems
- # jobs > # CPUs
- sum(memory) > available memory
- Cache thrashing (VM or CPU)
· Resource management to the rescue
The key win of using a batch scheduler is that it controls usage of memory and
disk resources as well. Even if you're not CPU bound, a job scheduler can
eliminate contention for memory (discussed later) by controlling the total
memory footprint of jobs that are runnable at any one time. When you're short
on memory, 2 x 1/2 isn't 1; it's more like 0.5
Version 3.10 LISA 2006 43
©1994-2006 Hal Stern, Marc Staveley
Resource Management
· Job Scheduler: serialization
- Batch queue system
- Line up jobs for CPUs like bank tellers
- Manage total memory footprint
· Batch Scheduler: prioritization
- Modifies scheduler to only let some jobs run when system is idle
· Fair Share Scheduler: parallelization
- Gives groups of processes "shares" of memory and CPU
Your goal with the job scheduler is to reduce the average wait time for a job. If
the typical time to complete is 5 minutes for a job when, say, 5 jobs run in
parallel, then you should try getting the average completion time down into the
1 1/2 to 3 minute range by freeing up resources for each job to run as fast as
possible. Even though the jobs run serially, the expected time to completion is
lower when each job finishes more quickly.
A batch scheduler for Solaris is available from Sun PS's Customer Engineering
group
An example of one produced using the System V dispatch table is described in
SunWorld, July 1993
Version 3.10 LISA 2006 44
©1994-2006 Hal Stern, Marc Staveley
Context Switches
· What is a context switch? (cs or ctx)
- New runnable thread (kernel or user) gets CPU
- Rates vary on MP systems
· Causes
- Single running process yields to scheduler
- Interrupt makes another process runnable
- Process waits on I/O or event (signal)
· A symptom, not a problem
- With high interrupt rates: I/O activity
- With high system call rates: bad coding
Version 3.10 LISA 2006 45
©1994-2006 Hal Stern, Marc Staveley
Traps and System Calls
· What is a trap?
- User process requests operating system help
· Causes
- System call, page fault (common)
- Floating Point Exception
- Unimplemented instructions
- Real memory errors
· Less common traps are cause for alarm
- Wrong version of an executable
- Hardware troubles
Version mismatches:
SPARC V7 has no integer multiply/divide
SPARC V8 has imul/idiv, and optimized code uses it. When run on a SPARC
V7 machine, each imul generates an unimplemented instruction trap, which the
kernel handles through simulation, using the same user-level code the compiler
would have inserted for a V7 chip.
Symptoms of this problem: very high trap rate (on the order of thousands per
second, or about one per arithmetic operation) but no system calls. Normally, a
high trap rate is coupled with a high system call rate -- the system calls generate
traps to get the kernel's attention.
Version 3.10 LISA 2006 46
©1994-2006 Hal Stern, Marc Staveley
Memory Consumption
& Paging (Swap) Space
Section 2.2
Version 3.10 LISA 2006 47
©1994-2006 Hal Stern, Marc Staveley
Page Lifecycle
· Page creation: at boot time
· Page fills
- From file: executable, mmap()
- From process: exec()
- Zero Fill On Demand (zfod): /dev/zero
· Page uses
- Kernel and its data structures
- Process text, data, heap, stack
- File cache
· Pages backed by filesystem or paging (swap)
space
/dev/zero is the "backing store" for zero-filled pages. It produces an endless
stream of zeros -- you can map it, read it, or cat it, and you get zeros.
/dev/null is a bottomless sink – you write to it and the data disappears.
Reading from /dev/null produces an immediate end of file, not pages of
zeros.
Version 3.10 LISA 2006 48
©1994-2006 Hal Stern, Marc Staveley
Filesystem Cache
· System V.4 uses main memory
- Systems run with little free memory
- Available frames used for files
· Side effects
- Some page freeing normal
- All filesystem I/O is page in/out
· Solaris (>= 8)
- Uses the cyclic page cache for filesystem pages� filesystem cache lives on the free list
Version 3.10 LISA 2006 49
©1994-2006 Hal Stern, Marc Staveley
Paging (Swap) Space & VM
· BSD world
- Total VM = swap plus shared text segments
- Must have swap at least as large as memory
- Can run out of swap before memory
· Solaris world
- Total VM = swap + physical memory - N pages
- Can run swap-less
- Swap = physical memory "overage"
· Running out of swap
- EAGAIN, no more memory, core dumps
If you run swapless, you cannot catch a core dump after a reboot (since there is
no space for the core dump to be written).
Version 3.10 LISA 2006 50
©1994-2006 Hal Stern, Marc Staveley
Estimating VM Requirements
· Look at output of ps command
- RSS: resident set size, how much is in memory� Total RSS field for lower bound
- SZ: stack and data, backed by swap space� Total SZ field for upper bound, good first cut
· Memory leaks
- Processes grow� SZ increases
- Examine use of malloc()/free()
- Will exhaust paging space� may hang system
Under Solaris, you can use the memtool package to estimate VM requirements
(http://playground.sun.com/pub/memtool).
Memory leaks are covered in more detail in Section 5, as an application problem.
Your first indication that you have an issue is when you notice VM problems,
which should point back to an application problem, so we mention it here first.
Version 3.10 LISA 2006 51
©1994-2006 Hal Stern, Marc Staveley
Paging Principles
· Reclaim pages when memory runs low
- Start running pagedaemon (pid 2)
· Crisis avoidance
- Guiding principle of VM system
· Page small groups on demand
- Keep small pool free
- Swap to make large pools available
- Compare 200M swapped out in one step to 64M paged out in 16,000 steps
The hands of the "clock" sweep through the page structures at the same rate,
at a fixed distance apart (handspreadpages).
If the fronthand encounters a page whose reference bit is on, it turns the bit
off. When the backhand looks at the page later, it checks the bit. If the bit is
still off, nothing referenced this page since the fronthand looked at it. The
page may move onto the page freelist (or written to swap).
The rate at which the hands sweep through the page structures varies
linearly with the amount of free memory. If the amount of free memory is
lotsfree, the hands move at a minimum scan rate, slowscan. As the
amount of free memory approaches 0, the scan rate approaches fastscan.
Handspreadpages – determines the amount of time an application has to
touch a page before it will be stolen for the free list.
Version 3.10 LISA 2006 52
©1994-2006 Hal Stern, Marc Staveley
VM Pseudo-LRU Analysis
· pagedaemon runs every 1/4 second
- Runs for 10 msec to "sweep a bit"
· Clock algorithm
- Pages arranged in logical circle
Backhand
Fronthand
han
dsp
read
"
Version 3.10 LISA 2006 53
©1994-2006 Hal Stern, Marc Staveley
VM Thresholds (Solaris >= 2.6)
· Lotsfree: defaults to 1/64 of memory
- Point at which paging starts
- Up to 30% of memory (not enforced)
· desfree: panic button for swapper
- ½ lotsfree
· minfree: unconditional swapping
- ½ desfree
- low water mark for free memory
Version 3.10 LISA 2006 54
©1994-2006 Hal Stern, Marc Staveley
VM Thresholds (cont.)
· cachefree
- Solaris 2.6 (patch 105181-10) and Solaris 7� Not Solaris >= 8
- lotsfree * 2
- page scanner looks for unused pages that are not
claimed by executables (file system buffer cache pages)
· cachefree > lotsfree > desfree > minfree
- Strict ordering
- lotsfree-desfree gap should be big enough for a typical process creation or malloc(3) request.
If priority_paging=1 is set in /etc/system then cachefree is set to twice
lotsfree (otherwise cachefree == lotsfree), and slowscan moves to
cachefree (see next slide).
10% to 300% Desktop performance increase
Not clear if it is any good for servers, depends on the type. Typically not good for file servers.
Version 3.10 LISA 2006 55
©1994-2006 Hal Stern, Marc Staveley
VM Thresholds in action
Slowscan
Fastscan
minfree
desfree
lotsfree
100
8192
4MB 8MB 16MB
Free Memory
Scan Rate
cachefree
32MB
minfree is needed to allow "emergency" allocation of kernel data structures such
as socket descriptors, stacks for new threads, or new memory/VM system
structures. If you dip below minfree, you may find you can't open up new
sockets (and you'll see EAGAIN errors at user level).
The speed at which you crash through lotsfree toward minfree is driven by the
demand for memory. The faster you consume memory, the more headroom you
need above minfree to allow the system to absorb the new demand.
Solaris >= 2.6
fastscan = min( ½ mem, 64 MB)
slowscan = min( 1/20 mem, 100 pages)
handspreadpages = fastscan
Therefore all of memory is scanned in 2 (20) seconds at fastscan (slowscan) and
an application has 1 (10) seconds to reference a page before it will be put on the
free list [for a 128MB machine, like they still exist...]
Version 3.10 LISA 2006 56
©1994-2006 Hal Stern, Marc Staveley
Sweep Times
· Time required to scan all of memory� physmem/fastscan lower bound
� physmem/slowscan upper bound
· Shortest window for pages to be touched� handspreadpages/fastscan
· Application-dependent tuning
- Increase handspread, especially on large memory machines
- Match LRU window (coarsely) to transaction duration
As an example of an upper bound on the scanning time: consider slowscan at
100 pages/second, and a 640M machine with a 4K pagesize. That's 160K pages,
meaning a full memory scan will take 1600 seconds. Crank up the value of
fastscan to reduce the round-trip scanning time if required
The output of vmstat -S shows you how many "revolutions" the clock hands
have made. If you find the system spinning the clock hands you may be
working too hard to free too few pages of memory.
Some tuning may help for large, scientific applications that have peculiar or
very well-understood memory traffic patterns. Sequential access, for example,
benefits from faster "free behind"
Servers (systems doing lots of filesystem I/O) should set fastscan large (131072
[8KB] pages = 1GB/second)
Version 3.10 LISA 2006 57
©1994-2006 Hal Stern, Marc Staveley
Activity Symptoms
· Scan rate (sr), free rate (fr)
- Progress made by pagedaemon
· Pageouts (po)
- Page kicked out of memory pool, file write
· Pagein (pi)
- Page fault, filled from text/swap, file read
· Reclaim (re)
- Waiting to go to disk, brought back
· Attach (at)
- Found page already in cache (shared libraries)
If you see the scan rate (sr) and the free rate (fr) about equal, this means the
virtual memory system is releasing pages as fast as it's scanning them. Most
probably, the least-recently used algorithm has degenerated into "last scanned",
meaning that tuning the handspread or the scan rates may improve the page
selection process.
Version 3.10 LISA 2006 58
©1994-2006 Hal Stern, Marc Staveley
VM Problems
· Basic indicator: scanning and freeing
- Page in/out could be filesystem activity
· Swapping
- Large memory processes thrashing?
· Attaches/reclaims
- open/read/close loops on same file
· Kernel memory exhaustion
- sar -k 1 to observe
- lotsfree too close to minfree
- Will drop packets or cause malloc() failures
Chris Drake and Kimberley Brown's "Panic!" is a great reference, including a
host of kernel monitoring and sampling scripts.
Version 3.10 LISA 2006 59
©1994-2006 Hal Stern, Marc Staveley
Other Tunables
· maxpgio
- # swap disks * 40 (Solaris <= 2.6)
- # swap disks * 60 (Solaris == 9)
- # swap disks * 90 (Solaris >= 10)
· maxslp
- Solaris < 2.6� Deadwood timer: 20 seconds
� Set to 0xff to disable pre-emptive swapping
- Solaris >= 2.6� swap out processes sleeping for more than maxslp seconds (20) if avefree < desfree
Tuning these values produces the best returns for your effort.
maxpgio (assumes one operation per revolution * 2/3)
# swap disks * 40 for 3,600 RPM disks
# swap disks * 80 for 7,200 RPM disks
# swap disks * 110 for 10,000 RPM disks
# swap disks * 167 for 15,000 RPM disks
[ 2/3 of the revolutions/second]
maxslp added meaning between Solaris 2.5.1 and 2.6, it is also used as the
amount of time that a process must be swapped out before being considered a
candidate to be swapped back in, in low memory conditions.
Version 3.10 LISA 2006 60
©1994-2006 Hal Stern, Marc Staveley
VM Diagnostics
· Add memory for fileservers
- Improve file cache hit rate
- Calculate expected/physical I/O rates
· Add memory for DBMS servers
- Configure DBMS to use it in cache
- Watch DBMS statistics for use/thrashing
- 100-300M is typical high water mark
· Add memory to eliminate scanning
Version 3.10 LISA 2006 61
©1994-2006 Hal Stern, Marc Staveley
Memory Mapped Files
· mmap() maps open file into address space
- Replaces open(), malloc(), read() cycles
- Improves memory profile for read-only data
- Used for text segments and shared data segments
· Mapped files pages to underlying filesystem
- Text segments paged from NFS server?
- Data files paged over network from server?
· When network performance matters...
- Use shared memory segments, paged locally
NFS-mounted executables produce sometimes unwanted effects due to the way
mmap() works over the network. When you start a Unix process (in SunOS 4.x,
or any SystemV.4/Solaris system), the executable is mapped into memory using
mmap() -- not copied into memory as in earlier BSD days. Once the executable
pages are loaded, you won't notice much difference, but if you free the pages
containing the text segment (due to paging/swapping), you're going to re-read
the data over the wire, not from the local swap device.
Version 3.10 LISA 2006 62
©1994-2006 Hal Stern, Marc Staveley
New VM System (Solaris >= 8)
· Page scanner is a bottleneck for the future
- new hardware supports > 512GB� 64-16M pages to scan!
· File system pressure on the VM
- high filesystem load depletes free memory list
- resulting high scan rates makes applications suffer from excessive page steals
- A server with heavy I/O pages against itself!
· Priority paging (new scanner) is not enough
· Cyclic Page Cache is the current answer
- separate pool for regular file pages
- fs flush daemon becomes fs cache daemon
Version 3.10 LISA 2006 63
©1994-2006 Hal Stern, Marc Staveley
Disk I/O
Section 2.3
Version 3.10 LISA 2006 64
©1994-2006 Hal Stern, Marc Staveley
Disk Activity
· Paging and swapping
- Memory shortfalls
· Database requests
- Lookups, log writes, index operations
· Fileserver activity
- Read, read-ahead, write requests
Version 3.10 LISA 2006 65
©1994-2006 Hal Stern, Marc Staveley
Disk Problems
· Unbalanced activity
- "Hot spot" contention
· Unnecessary activity
- Hit disk instead of memory
· Disks and networks are sources of greatest
gains in tuning
Version 3.10 LISA 2006 66
©1994-2006 Hal Stern, Marc Staveley
Diagnosing Disk Problems
· iostat -D: disk ops/second % iostat -D 5 sd0 sd1rps wps util rps wps util 8 0 22.0 40 0 90.0
- Look for excessive number of ops/disk
- Unbalanced across disks?
· iostat -x: service time (svc_t)
- Long service times (>100 msec) imply queues
- Similar to disk imbalance
- Could be disk overload (20-40 msec)
The typical seek/rotational delays on a disk are 8-15 msec. A typical transfer
takes about 20 msec. If the disk service times are consistently around 20 msec,
the disk is almost always busy. When the service times go over 20 msec, it
means that requests are piling up on the spindle: an average service time of 50
msec means that the queue is about 2.5 requests (50/20) long.
Note that for low I/O volumes, the service times are likely to be inaccurate and
on the high side. Use the service times as a thermometer for disk activity when
you're seeing a steady 10 I/O operations (iops) per second or more.
Version 3.10 LISA 2006 67
©1994-2006 Hal Stern, Marc Staveley
Disk Basics
· Physical things
· Disk performance
- sequential transfer rate� 5 - 40 MBytes/s
� Theoretical max: nsect * 512 * rpm / 60
- 50-100 operations/s random access
- 6-12 msec seek, 3-6 msec rotational delay
- Track-to-track versus long seeks
· Seek/rotational delays
- Access inefficiencies
While nsect * 512 * rpm tells you how fast the spinning disk platter can deliver
data, it's not completely accurate for the zone-bit recorded (ZBR) disks that are
common today. ZBR SCSI disks only fudge the nsect value in the disk
description, providing an average number of sectors per cylinder. In reality, the
first 70% of the disk is quite fast and the last 30% has a lower transfer rate.
Version 3.10 LISA 2006 68
©1994-2006 Hal Stern, Marc Staveley
SCSI Bus Basics
· SCSI 1 (5MHz clock rate)
- 8, 16-bit (wide), or 32-bit (fat)
- Synchronous operation yields 5 Mbyte/sec
· SCSI 2 - Fast (10MHz clock rate)
- 10 Mbytes/s with 8-bit bus
- 20 Mbytes/s with 16-bit (wide) bus
· Ultra (20MHz clock rate)
- Ultra/wide = 40MB/sec
· Ultra 2 (40MHz clock rate)
· Ultra 3 (80MHz clock rate)
If devices from different standards exist on the same SCSI bus then the clock rate
of all devices is the clock rate of the slowest device.
Ultra 3 is sometimes called Ultra 160.
Version 3.10 LISA 2006 69
©1994-2006 Hal Stern, Marc Staveley
SCSI Cabling Basics
· Single Ended
- 6m for SCSI 1
- 3m for SCSI 2
- 3m for Ultra up to 4 devices
- 1.5m for Ultra > 4 devices
· Differential
- 25m cabling
· Low Voltage Differential (LVD)
- 12m cabling
- used by Ultra 2 and 3
Differential signaling is used to suppress noise over long distances. If you ask a
friend to signal you with a lantern, it's easy to distinguish high (1) from low (0).
If the friend is now standing on a boat, which introduces noise (waves), it's
much harder to differentiate high and low. Instead, give your friend two
lanterns, and define "high" as "lanterns apart" and "low" as "lanterns together".
The noise affects both lanterns, but measuring the difference between them edits
the noise from the resulting signal.
If Single Ended and LVD exist on the same bus then the cabling lengths are the
minimum of the two.
Version 3.10 LISA 2006 70
©1994-2006 Hal Stern, Marc Staveley
Fibre Channel and iSCSI
· Industry standard at the frame level
- FC-AL: fiber channel arbitrated loop
- 100 Mbytes/sec typical
- Use switches and daisy chains to build storage networks
· Vendors layer SCSI protocol on top
- SCSI disk physics still apply
- But you can pack a lot of disks on the fiber
· Ditto iSCSI over GigE
Version 3.10 LISA 2006 71
©1994-2006 Hal Stern, Marc Staveley
The I/O Bottleneck
· When can't an 72GB disk hold a 500MB DB?
- When you need more than 100 I/Os per second
· How do you get > 40MByte/s file access?
- Gang disks together to "add" transfer rates
· Key info nugget #1: Access pattern
- Sequential or random, read-only or read-write
· Key info nugget #2: Access size
- 2K-8K DMBS, but varies widely
- 8K NFS v2, 32K NFS v3
- 4K-64K filesystem
Realize that when you're bound by random I/O rates, you're not moving that
much data -- the bottleneck is the physical disk arm moving across the disk
surface.
At 100 I/O operations/sec, and 8 KBytes/operations, a SCSI disk moves only
800 KBytes/sec at maximum random I/O load.
The same disk will source 40 MBytes/sec in sequential access mode, where the
disk speed and interface are the limiting factors.
Version 3.10 LISA 2006 72
©1994-2006 Hal Stern, Marc Staveley
Disk Striping
· Combine multiple disks into single logical disk
with new properties
- Better transfer rate
- Better average seek time
- Large capacity
· Terminology
- Block size: chunk of data on each disk in stripe
- Interleave: number of disks in stripe
- Stripe size: block size * interleave
Version 3.10 LISA 2006 73
©1994-2006 Hal Stern, Marc Staveley
Volume Management
· Striping done at physical (raw) level
- Run raw access processes on stripe (DBMS)
- Can build filesystem on volume, after striping
- Host (SW) or disk array (HW) solutions
· Some DBMSs do striping internally
· Bottleneck: multiple writes
- Stripe over multiple controllers, SCSI busses
Version 3.10 LISA 2006 74
©1994-2006 Hal Stern, Marc Staveley
Striping For Sequential I/O
· Each request hits all disks in parallel
· Add transfer rates to "lock heads"
· Block size = access size/interleave
· Examples:
- 64K filesystem access, 4 disks, 16K/disk
- 8K filesystem access, 8 disks, 1K/disk
· Can get 3-3.5x single disk
- On a 4-6 way stripe
Version 3.10 LISA 2006 75
©1994-2006 Hal Stern, Marc Staveley
Striping For Random I/O
· Each request should hit a different disk
· Random requests use all disks
- Force scattering of I/O
· Reduce average seek time with "independent
heads"
· Block size = access size
· Examples:
- 8K NFS access on 6 disks, 48K stripe size
- 2K DBMS access on 4 disks, 8K stripe size
Version 3.10 LISA 2006 76
©1994-2006 Hal Stern, Marc Staveley
Transaction Modeling
· Types: read, write, modify, insert
· Meta data structure impact
- Filesystem structures: inodes, cylinder groups, indirect blocks
- Logs and indexes for DBMS
Insert operation is R-M-W on index, W on data, W on log
Insert/update on DBMS touches data, index, log
Version 3.10 LISA 2006 77
©1994-2006 Hal Stern, Marc Staveley
Cache Effects
· Not every logical write I/O hits disk
- DB write clustering
- NFS, UFS dirty page clustering
- Hardware arrays may cache operations
· Reads can be cached
- DB page/block cache (Oracle SGA, e.g.)
- File/data caching in memory
· Locality of reference
- Cache can help or hurt performance
Version 3.10 LISA 2006 78
©1994-2006 Hal Stern, Marc Staveley
Simple DBMS Example
· Medium sized database on a busy day
- 200 users, 8 Gbyte database, 1 request/10 sec
- 50% updates, 20% inserts, 30% lookups, 4 tables, 1 index on each
· Disk I/O rate calculation
- .5 * 4/U + .2 * 3/I + .3 * 2/L = 3.2 I/O per table
- 12.8 I/O per transaction, ~10 with caching?
· Arrival rate
- 200 users * 1 / 10 secs = 20/sec
- Demand: 200 I/Os/sec, peak to 220 or more
The sample disk I/O rates are derived as follows:
Updates have to do a read, an update to an index and an update to a data block,
as well as a log write (4 transactions)
Inserts do an index and data block write, and a log write (3 transactions)
Lookups read from the index and data blocks (2 transactions)
Version 3.10 LISA 2006 79
©1994-2006 Hal Stern, Marc Staveley
Haste Needs Waste
· Using a single disk is a disaster
- Disk can only do 50-60 op/s, response time ~ 10/s
· 4 disks barely do the job
- Provides 200-240 I/Os/sec
- DBMS uses 90% of I/O rate capacity
· 6 disks would be better
- Waste most of the available space
Version 3.10 LISA 2006 80
©1994-2006 Hal Stern, Marc Staveley
Filesystem Optimization
Section 2.4
Version 3.10 LISA 2006 81
©1994-2006 Hal Stern, Marc Staveley
UNIX Filesystem
· Filesystem construction
- Each file identified by inode
- Inode holds permissions, modification/access times
- Points to 12 direct (data) blocks and indirect blocks
· Indirect block contains block pointers to data
blocks
· Double indirect blocks contain pointers to
blocks that contain pointers to data blocks
Version 3.10 LISA 2006 82
©1994-2006 Hal Stern, Marc Staveley
UFS Inode
Mode, time
Owners
Etc...
Indirect
Double ind.
12
direct
blocks
Data
Data
Data
2048
slots
2048
slots
2048
slots
Data
Data
Data
Data
Data
Data
Inode
2048
slots
2048
slots
Data
Data
Data
- Direct blocks up to 100 KBytes
- Indirect blocks up to 100 MBytes
- Double indirect blocks up to 1 TByte
Version 3.10 LISA 2006 83
©1994-2006 Hal Stern, Marc Staveley
Filesystem Mechanics
· Inodes are small and of fixed size
- Fast access, easy to find
· File writes flushed every 30 seconds
- Sync or update daemon
- UNIX writes are asynchronous to process
- Watch for large bursts of disk activity
· Filesystem metadata
- Create redundancy for repair after crash
- Cylinder groups, free lists, inode pointers
- fsck: scan every block for "rollback"
The fact that write() doesn't complete synchronously can cause bizarre failures.
Most code doesn't check the value of errno after a close(), but it should. Any
outstanding writes are completed synchronously when close() is called.
If any errors occurred during those writes, the error is reported back through
close(). This can cause a variety of problems when quotas are exceeded or disks
fill up (over NFS, where the server notices the disk full condition).
More details: SunWorld Online, System Administration, October 1995
http://www.sun.com/sunworldonline
Version 3.10 LISA 2006 84
©1994-2006 Hal Stern, Marc Staveley
The BSD Fast Filesystem
· Original UNIX filesystem
- All inodes at the beginning of the disk
- open() followed by read() always seeks
· BSD FFS improvements
- Cylinder groups keep inodes and data together
- Block placement strategy minimizes rotational delays
- Inode/cylinder group ratio governs file density
· Minfree: default 10%, safe to use 1% on 1+G
disks
McKusick, Leffler, Quaterman and Karels, "Design and Implementation of the
4.3 BSD Operating System"
mkfs and newfs always look at the # bytes per inode parameter (fixed). To
change the inode density, you need to change the number of cylinders in a
group by adjusting the number of sectors/track:
Filesystems for large files (like CAD parts files) usually have more bytes per
inode; filesystems for newsgroups should have fewer bytes per inode (with the
exception of the filesystem for alt.binaries.*)
Version 3.10 LISA 2006 85
©1994-2006 Hal Stern, Marc Staveley
Fragmentation & Seeks
· Fragments occur in last block of file
- Frequently less than 1% internal fragmentation
- 10% free space reduces external fragmentation
· Block placement strategy breaks down
- Avoid filling disk to > 90-95% of capacity
- Introduces rotational delays
· File ordering affects performance
- Seeking across large disk for related files
Version 3.10 LISA 2006 86
©1994-2006 Hal Stern, Marc Staveley
Large Files
· Reading
- Read inode, indirect block, double indirect block, data block
- Sequential access should do read-ahead
· Writing
- Update inode, (double) indirect, data blocks
- Can be up to 4 read-modify-write operations
· Large buffer sizes are more efficient
- Single access for "window" of metadata
Version 3.10 LISA 2006 87
©1994-2006 Hal Stern, Marc Staveley
Turbocharging Tricks
· Striping
· Journaling (logging)
- Write meta data updates to log, like DBMS
- Eliminate fsck, simply replay log
- Ideal for synchronous writes, large files
- logging option (Solaris >= 7)
· Extents
- Cluster blocks together and do read-ahead
- Eliminate more rotational delays
- Can add 2-3x performance improvement
McVoy and Kleiman, "Extent-like Performance From The UNIX Filesystem",
Winter USENIX Proceedings, 1991.
Linux also has the EXT2 filesystem, which is extent based and has different
placement policies.
Journaling and logging are often used interchangeably. Logging filesystems and
log-based filesystems, however, are not the same thing. A logging filesystem
bolts a log device onto the UNIX filesystem to accelerate writes and recovery. A
log-based filesystem is a new (non-BSD FFS) structure, based on a log of write
records. There is a long and exacting description of the differences in Margo
Seltzer's PhD thesis from UC-Berkeley.
Version 3.10 LISA 2006 88
©1994-2006 Hal Stern, Marc Staveley
Access Patterns
· Watch actual processes at work
- What are they doing?� nfswatch: NFS operations on the wire
� truss (Solaris, SysV.4), strace (Linux, HPUX),ktrace (*BSD)
� dtrace (Solaris >= 10)
· Application write size should match filesystem
block size.
· Use a Filesystem benchmark
- Are the disks well balanced, is the filesystem well tuned?
� filebench, bonnie
More details on using these tools: SunWorld Online, System Administration,
September 1995
http://www.sun.com/sunworldonline
Don't use process tracing for performance-sensitive issues, because turning on
system call trapping (used by the strace/truss facility) slows the process down
to a snail's pace.
Solaris Dtrace (Solaris >= 10) is more light weight.
Bonnie (http://www.textuality.com/bonnie) is a good all-round Unix
filesystem benchmark tool
Filebench extensible system to simulate many different types of workloads
http://sourceforge.net/projects/filebench/
http://www.opensolaris.org/os/community/performance/filebench/
Version 3.10 LISA 2006 89
©1994-2006 Hal Stern, Marc Staveley
Resource Optimization
· Optimize disk volumes by type of work
- Sequential versus random access filesystems
- Read-only versus read-write data
· Eliminate synchronous writes
- File locking or semaphores more efficient
- Journaling filesystem faster
· Watch use of symbolic links
- Often causes disk read to get link target
· Don't update the file access time for read-only
volumes
Don't update the file access time (for news and mail spools, etc.)
-o noatime
Delay updating file access time (Solaris >= 9)
-o dfratime
Version 3.10 LISA 2006 90
©1994-2006 Hal Stern, Marc Staveley
Non-Volatile Memory
· Battery backed memory
- RAM in disk array controller (array cache)
- disk cache
· Synchronous writes at memory write speed
Version 3.10 LISA 2006 91
©1994-2006 Hal Stern, Marc Staveley
Inode Cache
· Inode cache for metadata only
- Data blocks cached in VM or buffer pool
· Buffer pool for inode transit� vmstat -b
� sar -b 5 10
- Watch %rcache (read cache) hit rate
- Lower rate means more disk I/O for inodes
· Set high water mark� set bufhwm=8000 Solaris /etc/system
Version 3.10 LISA 2006 92
©1994-2006 Hal Stern, Marc Staveley
Directory Name Lookup Cache
· Name to inode mapping cache
· Must be large for file/user server
- Low hit rate causes disk I/O to read directories
· vmstat -S to observe
- Aim for > 90% hit rate
· Causes of low hit rates:
- File creation automatically misses
- Names > {14,32} characters not inserted� Long names not efficient
Solaris >= 2.6
- uses the ncsize parameter to set the DNLC size.
- handles long filenames in the DNLC
Solaris >= 8
- can use the kstat -n dnlcstats command to determine how well the DNLC is doing
Version 3.10 LISA 2006 93
©1994-2006 Hal Stern, Marc Staveley
Filesystem Replication
· Replicate popular read-only data
- Automounter or "workgroups" to segregate access
- Define update and distribution policies
· 200 coders chasing 4 class libraries
- Replicate libraries to increase bandwidth
· Hard to synchronize writeable data
- Similar to DBMS 2-phase commit problem
- Andrew filesystem (AFS)
- Stratus/ISIS rNFS, Uniq UPFS from Veritas
The ISIS Reliable NFS product is now owned by Stratus Computer,
Marlborough MA
Uniq Consulting Corp has a similar product that does N-way mirroring of NFS
volumes. Contact Kevin Sheehan at [email protected], or your local Veritas
sales rep, since Veritas is now reselling (and supporting) this product
Version 3.10 LISA 2006 94
©1994-2006 Hal Stern, Marc Staveley
Depth vs. Breadth
· Avoid large files if possible
- Break large files into smaller chunks
- Don't backup a 200M file for a 3-byte change
- Files > 100M require multiple I/Os per operation
· Directory search is linear
- Avoid broad directories
· Name lookup is per-component
- Avoid deep directories
- Use hash tables if required
Version 3.10 LISA 2006 95
©1994-2006 Hal Stern, Marc Staveley
Tuning Flush Rates
· Dirty buffers flushed every 30 seconds
- Causes large disk I/O burst
- May overload single disk
· Balance load if requests < 30s apart
- Generic update daemonwhile : do sync; sync; sleep 15; done
· Solaris tunables
- autoup: time to cover all memory
- tune_t_fsflushr: rate to flush
autoup is the oldest a dirty buffer can get before it is flushed. tune_t_fsflushr is
the rate at which the sync daemon is run; it defaults to 30 seconds.
All of memory will be covered in autoup seconds.
flushrate/autoup is the fraction covered by each pass of the update daemon.
Increase autoup, or cut the flush rate, to space out the bursts
Extremely large disk service times (in excess of 100 msec) can be caused by large
bursts from the flush daemon causing a long disk queue. If the filesystem flush
sends 20 requests to a single disk, it's likely there will be some seeking between
writes, so the 20 requests will average 20 msec each to complete. Since all disk
requests are scheduled in a single pass by fsflush, the service time for the last
one will be nearly 400 msec, while the first few will finish in around 20 msec,
yielding an average service time of 200 msec!
Version 3.10 LISA 2006 96
©1994-2006 Hal Stern, Marc Staveley
Backups & Redundancy
Section 2.5
Version 3.10 LISA 2006 97
©1994-2006 Hal Stern, Marc Staveley
Questions of Integrity
· Backups are total loss insurance
- Lose a disk
- Lose a brain: egregious rm *
· Disk integrity is inter-backup insurance
- Preserve data from high-update environment
- Time to restore backup is unacceptable
- Doesn't help with intra-day deletes
· Disaster recovery is a separate field
Version 3.10 LISA 2006 98
©1994-2006 Hal Stern, Marc Staveley
Disk Redundancy
· Snapshots
- Copy data to another disk or machine
- tar, dump, rdist, rsync
- Prone to failure, network load problems
· Disk mirroring (RAID 1)
- Highest level of reliability and cost
- Some small performance gains
· RAID arrays (RAID 5 and others)
- Cost/performance issues
- VLDB byte capacity
RAID = Redundant Array of Inexpensive Disks.
When the RAID levels were created (at UC-Berkeley), the popular disk format
was SMD (as in Storage Modular Device, not Surface Mounted Device).
10" platters weighed nearly 100 pounds and held 500 MB, while SCSI disks
topped out at 70 MB but cost significantly less (and were easier to lift and install)
Version 3.10 LISA 2006 99
©1994-2006 Hal Stern, Marc Staveley
RAID 1: Mirrored Disks
· 100% data redundancy
- Safest, most reliable
- Historically rejected due to disk count, cost
· Best performance (of all RAID types)
- Round-robin or geometric reads: like striping
- Writes at 5-10% hit
· Combine mirroring and striping
- Stripe mirrors (1+0) to survive interleave failures
- Mirror stripes (0+1) for safety with minimal overhead
RAID 0 = striping
Few systems can do 1+0
1+0 allows multi-disk failures as long as at least one mirror disk per stripe
survives.
Version 3.10 LISA 2006 101
©1994-2006 Hal Stern, Marc Staveley
RAID 5: Parity Disks
· Stripe parity and data over disks
· No single "parity hot spot"
· Performance degrades with more writes
- R-M-W on parity disk cuts 60%
- Similar to single-disk for reads
· Ideal for large DSS/DW databases
- If size >> performance, RAID 5 wins
- Best path to large, safe disk farm
· 20-40% cost savings
Version 3.10 LISA 2006 102
©1994-2006 Hal Stern, Marc Staveley
RAID 5 Tuning
· Tunables
- Array width (interleave) - sometimes
- Block size - required
· Count parity operations in I/O demand
- Read = 1 I/O
- Write = 4 I/O
· Ensure parity data is not a bottleneck
- Averaging parity disk reads and writes limited by (total) 50-60 IOP/second limit
RAID 5 write:- read original block- read parity block - xor original block with parity block - xor new block with parity block- write new block- write parity block
Version 3.10 LISA 2006 103
©1994-2006 Hal Stern, Marc Staveley
Backup Performance
· Derive rough transfer rate needs
- 100 GB/hour = 30 MB/second
- 5MB/s for DLT, 10MB/s for SDLT
- 15MB/s for LTO, 35MB/s for LTO-260MB/s for LTO-3
- 6MB/s for AIT, 24MB/s for AIT-4
- 80MB/sec over quiet Ethernet (GigE)
· Multiple devices increase transfer rate
- Stackers grow volume
- Drives increase bulk transfer rate
- Careful of “shoe shining”
When designing the backup system, also take into consideration the length of
time you must keep the data around. Some industries, such as financial
services, require at least a 7 year history of data for SEC or arbitration hearings.
Drug companies and health-care firms must keep patient data near-line until the
patient dies; if a drug pre-dated a patient by 5 years then you're looking at the
better part of a century.
Media types in vogue today decay. Magnetic media loses its bits; CD-ROMs
may decay after a long storage period. How will you read your backups in the
future? If you've struggled with 1600bpi tapes lately you know the feeling of
having data in your hand that's not convertible into on-line form.
Final warning: dump isn't portable! If you change vendors, make sure you can
dump and reload your data.
Version 3.10 LISA 2006 104
©1994-2006 Hal Stern, Marc Staveley
Backup to Disk
· Rdiff-backup
- incremental backups to disk with easy restore
· BackupPC
- incremental and full backups to disk with a web front end for scheduling and restores
- good for backing up MSwindows clients to Unix server
· Snapshots
· Offsite replicas
Is this familiar:
- Secure the individual systems
- Run aggressive password checkers
- Restrict NFS, or use NFS with Kerberos or DCE/DFS to encrypt file access
- Prevent network logins in the clear (use ssh)
- BUT: do backups over the network! Exposing the data over the network
during the backup un-does much of the effort in the other precautions.
Version 3.10 LISA 2006 105
©1994-2006 Hal Stern, Marc Staveley
NFS Performance Tuning
Section 3
Version 3.10 LISA 2006 106
©1994-2006 Hal Stern, Marc Staveley
Topics
· NFS internals
· Diagnosing server problems
· Client improvements
- Client-side caching & tuning
- NFS over WANs
Version 3.10 LISA 2006 107
©1994-2006 Hal Stern, Marc Staveley
NFS Internals
Section 3.1
Version 3.10 LISA 2006 108
©1994-2006 Hal Stern, Marc Staveley
NFS Request Execution
stat()
getattr()
nfs_getattr()
kernel RPC
port 2049 hardcoded
nfsd
getattr()
UFS stat()
HSFS stat()
% ls -l
Version 3.10 LISA 2006 109
©1994-2006 Hal Stern, Marc Staveley
NFS Characterization
· Small and large operations
- Small: getattr, lookup, readdir
- Large: read/write, create, readdir
· Response time matters
- Clamped at 50 msec for "reasonable" server
- Users notice 20 msec to 50 msec dropoff
· Scalability is still a concern
- Usually network limited, hard to reach capacity
- Flat response time is best measure
- Client-side demand management
Version 3.10 LISA 2006 110
©1994-2006 Hal Stern, Marc Staveley
NFS over TCP
· NFS/TCP is a win for:
- Wide-area networks, with higher bit error rates
- Routed networks
- Data-transfer oriented environments
- Large MTU networks, like GigE with jumbo frames
· Advantages
- Better error recovery, without complete retransmit
- Fewer retransmissions and duplicate requests
· Disadvantage
- Connection setup at mount time
Version 3.10 LISA 2006 111
©1994-2006 Hal Stern, Marc Staveley
NFS Version 3
· Improved cache consistency
- Attributes returned with most calls
- "access" RPC mimics permission checking of local system open() call
· Improved correctness with NFS/TCP
· Performance enhancements
- Asynchronous write operations, with logging
- Larger buffer sizes, up to 32KBytes
NFS v3 uses a 64-byte (not bit) file handle, with the actual size used per mount
negotiated between the client and server.
Version 3.10 LISA 2006 112
©1994-2006 Hal Stern, Marc Staveley
Diagnosing Server Problems
Section 3.2
Version 3.10 LISA 2006 113
©1994-2006 Hal Stern, Marc Staveley
Indicators
· Usual server tuning applies
- Don't worry about CPU utilization
- Client response time is early warning system
- Some NFS specific details
· Server isn't always the limiting factor
- Typical Ethernet supports 300-350 LADDIS ops
- To get 2,000 LADDIS: 7-8 Ethernets
LADDIS stands for Legato, Auspex, Digital, Data General, Interphase and Sun,
the 6 companies that helped produce the SPEC standard for NFS benchmarks.
LADDIS is now formally known as SPEC NFS and is reported as a number of
ops/sec, at 50 msec response time or less.
More info: [email protected]
Keith, Bruce. LADDIS: The Next Generation in NFS Server Benchmarking. spec
newsletter. March 1993. Volume 5, Issue 1.
Watson, Andy, et.al. LADDIS: A Multi-Vendor and Vendor-Neutral SPEC
NFS Benchmark. Proceedings of the LISA VI Conference, October 1992. pp. 17-
32.
Wittle, Mark, and Bruce Keith. LADDIS: The Next Generation in NFS File
Server Benchmarking. Proceedings of the Summer 1993 USENIX Conference.
July 1993. pp. 111-128.
Version 3.10 LISA 2006 114
©1994-2006 Hal Stern, Marc Staveley
Request Queue Depth
· nfsd daemons/threads
- One request per nfsd daemon
- Lack of nfsds makes server drop requests
- May show up as UDP socket overflows (netstat -s)
· Guidelines
- Daemons: 24-32 per server, more for many disks
- Kernel threads (Solaris): 500-2000� no penalty for being long
- Add more for browsing environment
Version 3.10 LISA 2006 115
©1994-2006 Hal Stern, Marc Staveley
Attribute Hammering
· Use nfsstat -s to view server statistics
· getattr > 40%
- Increase client attribute cache lifetime
- Consolidate read-only filesystems
· readlink > 5%
- Replace links with mount points
· writes > 5%
- NVRAM situation
% nfsstat -s
null getattr setattr root lookup readlink read
32 0% 527178 33% 9288 0% 0 0% 449726 28% 189466 12% 188665 15%
wrcache write create remove rename link symlink
0 0% 134797 8% 13799 0% 15826 1% 2725 0% 4388 0% 74 0%
mkdir rmdir readdir fsstat
1575 0% 1532 0% 23898 1% 242 0%
On an NFS V3 client, you'll see entries for cached writes, access calls, and other
extended RPC types.
Version 3.10 LISA 2006 116
©1994-2006 Hal Stern, Marc Staveley
Transfer Oriented Environments
· Ensure adequate CPU
- 1 CPU per 3-4 100BaseT Ethernets
- 1 CPU per 1.5 ATM networks at 155 Mb/s
- 1 CPU per 1000BaseT Ethernet (GigE)
· Disk balancing is critical
- Optimize for random I/O workload
· Large memory may not help
- What is working set/file lifecycle?
Version 3.10 LISA 2006 117
©1994-2006 Hal Stern, Marc Staveley
Client Improvements
Section 3.3
Version 3.10 LISA 2006 118
©1994-2006 Hal Stern, Marc Staveley
Client Tuning Overview
· Eliminate end to end problems
- Request timeouts are call to action
- 700 msec timeout versus 50 msec "pain" level
· Reduce demand with improved caching
· Adjust for line speed
- < Ethernet links
- Uncontrollable congestion
- Routers or multiple hops
· Application tuning rules apply
Version 3.10 LISA 2006 119
©1994-2006 Hal Stern, Marc Staveley
Client Retransmission (UDP only)
· Unanswered RPC request is retransmitted
- Repeated forever for hard mounts
- Up to 5 times (retrans) for soft mounts
· What can go wrong?
- Severe congestion (storms)
- Server dropping requests/packets
- Network losing requests or sections of them
- One lost packet kills entire request
Version 3.10 LISA 2006 120
©1994-2006 Hal Stern, Marc Staveley
Measuring Client Performance
· Client-side performance is what user sees
· nfsstat -m OK for NFS over UDP
- Shows average service time for lookup, read and write requests
· iostat -n with extended service times
· NFS over TCP harder to measure
- Stream-oriented, difficult to match requests and replies
- tcpdump, snoop to match XIDs in NFS header� wireshark (ethereal) does this.
Version 3.10 LISA 2006 121
©1994-2006 Hal Stern, Marc Staveley
Client Impatience (NFS over UPD only)
· Use nfsstat -rc
· timers > 0
- Server slower than expected
- nfsstat -m: expected response time 1400 msecretrans++
700 msecretrans++
120 msectimers++
calls++ request
% nfsstat -rc
Client rpc:
calls badcalls retrans badxid timeout wait newcred timers
224978 487 64 263 549 0 0 696
% nfsstat -m
/home/thud from thud:/home/thud (Addr 192.151.245.13)
Lookups: srtt = 7 (17 ms), dev=4 (20ms), cur=2 (40ms)
Reads: srtt=14 (35 ms), dev=3 (15ms), cur-3 (60ms)
Note that the NFS backoff and retransmit scheme is not used for NFS over TCP,
since TCP's congestion control and restart algorithms properly fit the connection
oriented model of TCP traffic. The NFS mechanism is used for UDP mounts,
and the timers used for adjusting the buffer sizes and transmit intervals are
shown with nfsstat -m. On an NFS/TCP client, the timers will be zero.
· badcalls > 0
· Soft NFS mount failures
· Operation interrupted (application failure)
· Data loss or application failures
· Should never see these
Version 3.10 LISA 2006 122
©1994-2006 Hal Stern, Marc Staveley
Client's Network View (NFS over UPD only)
· retrans > 5%
- Requests not reaching server or not serviced
· badxid close to 0
- Network is dropping requests
- Reduce rsize, wsize on mount
· badxid > 0
- Duplicate request cache isn't consolidating retransmissions
- Tune server, partition network
Using NFS/TCP or NFS Version 3, you'll be hard-pressed to see badxid counts
above zero. Using TCP, the NFS client doesn't have to retransmit the whole
request, only the part that was lost to the server. As a result, there should rarely
be completely retransmitted requests. NFS V3 implementations also tend to be
more "correct" than V2 implementations, since fewer requests that are not
actually idempotent (like rmdir or remove) are retransmitted.
Version 3.10 LISA 2006 123
©1994-2006 Hal Stern, Marc Staveley
Client Caches
· Caching critical for performance
- If data exists, don't go over the wire
- Dealing with stale data
· Cached items
- Data pages in memory: default
- Data pages on disk: eNFS, CacheFS
- File attributes: in memory
- Directory attributes: in memory
- DNLC: local name lookup cache
Version 3.10 LISA 2006 124
©1994-2006 Hal Stern, Marc Staveley
Attribute Caching
· getattr requests can be > 40% of total
- May hit server disk
· Read-only filesystems
- Increase actimeo to 600 or more
- "Slow start" when data really changes
· Rapidly changing filesystem (mail)
- Try noac for no caching
· File locking disables attribute and data caching
When a file is locked on the client system, that client begins to read and write
the file without any buffering. If your application calls
read(fd, buf, 128);
you'll read exactly 128 bytes over the wire from the NFS server, bypassing the
attribute cache and the local memory cache to be sure you fetch the latest copy
of the data from the server.
If file locking and strict ordering of writes are an issue, consider using a
database.
Version 3.10 LISA 2006 125
©1994-2006 Hal Stern, Marc Staveley
CacheFS Tips
· Read-mostly, fixed size working set
- Re-use files after loading into cache
- Write-once files are worst case
- Growing or sparse working set causes thrashing
· Watch size of cache using df
· Multi-homed hosts
- CacheFS creates one cache per host name
- Make client bindings persistent, not random
- Penalty for cold cache less than that for no server
Using CacheFS solves the page-over-the-network problem where a process' text
segment is paged from the NFS server, not from a local disk. When using large
executables (some CAD applications, FORTRAN with many common blocks),
CacheFS may improve paging performance by keeping traffic on the local host.
Version 3.10 LISA 2006 126
©1994-2006 Hal Stern, Marc Staveley
Buffer Sizing
· Default of 8KB good for Ethernet speeds
- At 56 Kb requires > 1 second to transmit
- Remarkably anti-social behavior
- Even worse for NFSv3 (32KB packets)
· Reduce read/write sizes on slow links
- In vfstab, automounter� rsize=1024,wsize=2048
- Match to line speed and other uses
- 256 bytes is lower limit� readdir breaks with smaller buffer
Per Packet Time to ReadLine Speed rsize Latency 1 kbyte File
56 kbaud 128 bytes 20 msec 430 msec56 kbaud 256 bytes 40 mses 310 msec224 kbaud 256 bytes 10 msec 150 msecT1 (1.5 Mbit) 1024 bytes 1 msec 42 msec
Version 3.10 LISA 2006 127
©1994-2006 Hal Stern, Marc Staveley
Network Design and Capacity
Planning
Section 4
Version 3.10 LISA 2006 128
©1994-2006 Hal Stern, Marc Staveley
Topics
· Network protocol operation
· Naming services
· Network topology
· Network latency
· Routing architecture
Network reliability colors end to end performance. If your network is delaying
traffic or losing packets, or if you suffer multiple network hops each with long
latency, you will impact what the user sees. The worst possible example is the
Internet: you get variable response time depending upon how many people are
downloading images, what current events have users flocking to the major sites,
and the time of day/day of the week.
Version 3.10 LISA 2006 129
©1994-2006 Hal Stern, Marc Staveley
Network Protocol
Operation
Section 4.1
Version 3.10 LISA 2006 130
©1994-2006 Hal Stern, Marc Staveley
Up & Down Protocol Stacks
ICMP
ARP:update cache
copy into kernel
TCP slow startTCP segmentation
IP: locate route/interfaceIP: MTU fragmentation
IP: find MAC address
Eth: send packet
RIP update
route tables
ARP: get IP mapping
backoff/re-xmit
collision
Eth: accept frame
TCP/UDP: valid port?IP re-assembly
IP: match local IP?
read() on socketwrite() on socket
Solaris exposes nearly every tunable parameter in the TCP, UDP, IP, ICMP and
ARP protocols using the ndd tool.
Find a description of the tunable parameters and their upper/lower bounds on
Richard Steven's web page containing the Appendix to his latest TCP/IP books:
http://www.kohala.com/start/tcpipiv1.appe.update1.ps
Also on docs.sun.com at http://docs.sun.com/app/docs/doc/806-4015?q=tunable+parameters
Solaris 2 - Tuning Your TCP/IP Stack and More
http://www.sean.de/Solaris
Version 3.10 LISA 2006 131
©1994-2006 Hal Stern, Marc Staveley
Naming Services
Section 4.2
Version 3.10 LISA 2006 132
©1994-2006 Hal Stern, Marc Staveley
Round-Robin DNS
· Use several servers, in parallel, that have unique
IP addresses
- DNS will return all of the IP addresses in response to queries for www.blahblah.com
· Clients resolving the name get the IP addresses
in round-robin fashion
- When DNS cache entry times out, new one is requested
- Clients will wait up to DNS entry lifetime for a retry
Be sure to set the DNS server entry's Time To Live (TTL) to zero or a few
seconds, such that successive requests for the IP address of the named host get
new DNS entries
Name Servers that do Round-Robin:
- BIND 8
- djbdns
- lbnamed (true load balancer written in perl)
Version 3.10 LISA 2006 133
©1994-2006 Hal Stern, Marc Staveley
Round-Robin DNS, cont'd
· The Good
- No real failure management, it "just works"
- Scales very well; just add hardware and mix
- Only 1/N clients affected, on average, for N server farm (for a server failure)
· The Bad
- Clients can see minutes of "downtime" as DNS entries expire, if TTL is too long
- Can cheat with multiple A records per host, but not all clients sort them correctly
- None if done correctly
Version 3.10 LISA 2006 134
©1994-2006 Hal Stern, Marc Staveley
IP director
Web server
Web server
Web server
Web server
192.9.230.1
192.9.231.0
x.x.x.1
x.x.x.4
x.x.x.2
x.x.x.3
192.9.232.0
IP Redirection
IP director
www.blah.com
Version 3.10 LISA 2006 135
©1994-2006 Hal Stern, Marc Staveley
IP Redirection Mechanics
· Front-end boxes handle IP address resolution
- Public IP address shows up DNS maps
- Internal (private) IP networks used to distribute load
- Can have multiple networks, with multiple servers
· Improvement over DNS load balancing
- All round-robin choices made at redirector, so client's DNS configuration (or caches) don't matter
- Redirector can be made redundant
- Hosts could be redundant, too
· Cisco NetDirector, Hydra HydraWeb
Version 3.10 LISA 2006 136
©1994-2006 Hal Stern, Marc Staveley
Network Topology
Section 4.3
Version 3.10 LISA 2006 137
©1994-2006 Hal Stern, Marc Staveley
SwitchTrunking (802.3ad)
· Multiple connections to host from single switch
· Improves input and output bandwidth
- Eliminates impedance mismatch between switch and network connection
- Spread out input load on server side
· Warnings:
- Trunk can be a SPOF
- Assumes switch can handle mux/demux of traffic at peak loads
Solaris requires the SUNWtrku package (Sun Trunking Software)
Version 3.10 LISA 2006 138
©1994-2006 Hal Stern, Marc Staveley
Latency and Collisions
· Collisions
- CSMA/CD works "late", node backs off, tries again
- Fills Ethernet with malformed frames
· Defers
- CSMA/CD works "early"
- Not counted, but adds to latency
- Collisions "become" defers as more nodes share load
- Use netstat -k (Solaris >= 2.4) or kstat (Solarsis >= 7) to see defers and other errors
· 802.11
Version 3.10 LISA 2006 139
©1994-2006 Hal Stern, Marc Staveley
Dealing With Collisions
· Rate = collisions/output packets
· Collisions counted on transmit only
- Monitor on several hosts, especially busy ones
- Use netstat -i or LANalyzer to observe
- Collision rate can exceed 100% per host
· Thresholds
- Should decrease with number of nodes on net
- >5% is clear warning sign
- Usually 1% is a problem
- Correlate to network bandwidth utilization
Most Ethernet chip drivers understate the collision rate. In addition to only
counting collisions in which the station was an active participant, the chip may
report 0, 1 or "more than 1" collision. Most driver implements take "more than
1" to mean 2, which in fact it could be up to 16 consecutive collisions.
Version 3.10 LISA 2006 140
©1994-2006 Hal Stern, Marc Staveley
Collisions and Switches
· Switched Ethernet cannot have collisions (*)
- Each device talks to switch independently
- No shared media worries
· Still get contention at switch under load
- Ability of switch to forward packets to right interface for output
- Ability to handle input under high loads
· Look for dropped/lost packets on switch
- Results in NFS retransmission, RPC failure, NIS timeouts, dismal TCP throughput
Version 3.10 LISA 2006 141
©1994-2006 Hal Stern, Marc Staveley
Collisions, Really Now
· Full versus Half Duplex
- Full Duplex: each node has a home run and no contention for either path to/from switch
- Half Duplex: you can still see collisions, in rare cases
· What makes switch-host collide?
- Many small packets, in steady streams
- Large segments probably are OK
Version 3.10 LISA 2006 142
©1994-2006 Hal Stern, Marc Staveley
Switches and Routers
· Bridges, Switches
- Very low latency, single IP network or VLAN
- One input pipe per server
· Routers
- Higher latency, load dependent
- Multiple pipes per server
Version 3.10 LISA 2006 143
©1994-2006 Hal Stern, Marc Staveley
Switched Architectures
· Switches offer "home run" wiring
- Each station has dedicated, bidirectional port
- Reduce contention for media (collisions = 0)
- Construct virtual LANs on switch, if needed
- "Smooth out" variations in load
- Only broadcast & multicast normally cross between network segments
· Watch for impedance mismatch at switch
- 80 clients @ 100 Mb/s swamps a 2 Gb/s backplane
Version 3.10 LISA 2006 144
©1994-2006 Hal Stern, Marc Staveley
Network Partitioning
· Switches & bridges for physical partitioning
- Corral traffic on each side of bridge
- Primary goal: reduce contention
· Routing for protocol isolation
- Non-IP traffic (NetWare)
- Broadcast isolation (NetBIOS, vocal applications)
- Non-trusted traffic (use a firewall, too)
- VLAN capability on switches for creating geographically difficult wiring schemes
Version 3.10 LISA 2006 145
©1994-2006 Hal Stern, Marc Staveley
Network Latency
Section 4.4
Version 3.10 LISA 2006 146
©1994-2006 Hal Stern, Marc Staveley
Trickle Of Data?
· Serious fragmentation at router or host
· TCP retransmission interval too short
· Real-live network loading problem
· Handshakes not completing quickly
- Nagel algorithm (slow start)
- PCs often get this wrong
- Set tcp_slow_start_initial=2 to send two segments, not just one: dramatically improves web server performance from PC's view
- tcp_slow_start_after_idle=2 as well
�inhibit the sending of new TCP segments when new outgoing data arrives from the user if any previously transmitted data on the connection remains unacknowledged.� - John Nagel (RFC 896)
Version 3.10 LISA 2006 147
©1994-2006 Hal Stern, Marc Staveley
Longer & Fatter Pipes
· Fat networks (ATM, GigE, 10GigE)
- Benefit versus cost trade-offs
- Backbone or desktop connections?
· Longer networks (WAN)
- Guaranteed capacity, grade of service?
- End to end security and integrity?
· Latency versus throughput
- Still 20 msec coast to coast
- GigE jumbo frames >> Ethernet in latency, loses for small packets
Version 3.10 LISA 2006 148
©1994-2006 Hal Stern, Marc Staveley
Long Fat Networks
40 msec
latency
Send 4 KB of data
in 3 msec over T1
Wait 70+ msec to
send more,
producing gaps
in transmit
stream
Receiver sees gaps
in data, acks as fast as
it can
first bits arrive
in 40 msec,
last bit arrives
in 43 msec
Bad TCP/IP implementations will retransmit too much because it sees the high
latency as an indication that the packet didn't arrive and retransmit it. The
retransmit timer is too small.
Version 3.10 LISA 2006 149
©1994-2006 Hal Stern, Marc Staveley
Tuning For LFNs
· Set the sender and receiver buffer size high
water marks
- Usually an ndd or kernel option, but resist temptation to make "global fix"
- Set using setsockopt() in application to avoid running out of kernel memory
· Buffer depth = 2 * bandwidth * delay product
- or bandwidth * RTT (ping)
- 1.54 Mbit/sec network (T1) with 25 msec delay = 10 KB buffer
- 155 Mbit/sec network (OC3) with 25 msec delay = 1 MB buffer
Solaris:
# increase max tcp window (maximum socket buffer size)
# max_buf = 2 x cwnd_max (congestion window)
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152
# increase default SO_SNDBUF/SO_RCVBUF size.
ndd -set /dev/tcp tcp_xmit_hiwat 65536
ndd -set /dev/tcp tcp_recv_hiwat 65536
Linux (>= 2.4):
echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem
See http://www-didc.lbl.gov/tcp-wan.html and
http://www.psc.edu/networking/perf_tune.html for a longer explanation.
A list of tools to help determine the bandwidth of a link can be found at
http://www.caida.org/tools/taxonomy/.
Version 3.10 LISA 2006 150
©1994-2006 Hal Stern, Marc Staveley
Routing Architecture
Section 4.5
Version 3.10 LISA 2006 151
©1994-2006 Hal Stern, Marc Staveley
IP Routing
· IP is a "gas station" protocol
- Knows how to find next hop
- Makes best effort to deliver packets
· Kernel maintains routing tables
- route command adds entries
- So does routed
- Dynamic updates: ICMP redirects
Version 3.10 LISA 2006 152
©1994-2006 Hal Stern, Marc Staveley
What Goes Wrong?
· Unstable route tables (lies)
- Machines have wrong netmask or broadcast addresses
- Servers route by accident (multiple interfaces)
· Incorrect or missing routes
- Lost packets� nfs_server: bad sendreply
· Asymmetrical routes
- Performance skews for in/outbound traffic
Version 3.10 LISA 2006 153
©1994-2006 Hal Stern, Marc Staveley
RIP Updates
· Routers send RIP packets every 30 seconds
- Each router increases costs metric (cap of 15)
- Active/passive gateway notations
- /etc/gateways to seed behavior
· Default routes
- Chosen when no host or network route matches
- May produce ICMP redirects
- /etc/defaultrouter has initial value
Version 3.10 LISA 2006 154
©1994-2006 Hal Stern, Marc Staveley
Routing Architecture
· Default router or dynamic discovery
- One router or several?
- Dynamic recovery
- RDISC (RFC 1256)
· Multiple default routers
· Recovery time
- Function of network radix
Version 3.10 LISA 2006 155
©1994-2006 Hal Stern, Marc Staveley
Tips & Tricks
· Watch for IP routing on servers
- netstat -s shows IP statistics
- Consumes server CPU, network input bandwidth
· Name service dependencies
- Broken routing affects name service
- If netstat -r hangs, try netstat -rn
Version 3.10 LISA 2006 156
©1994-2006 Hal Stern, Marc Staveley
ICMP Redirects
· Packet forwarded over interface on which it
arrived
- ICMP redirect sent to transmitting host
- Sender should update routing tables
· Impact on default routes
- Implies a better choice is available
· Ignore or "fade" on host if incorrect� ndd -set /dev/ip ip_ignore_redirect 1
� ndd -set /dev/ip ip_ire_redirect_interval 15000
· Turn off to appease non-listeners� ndd -set /dev/ip ip_send_redirects 0
Version 3.10 LISA 2006 157
©1994-2006 Hal Stern, Marc Staveley
MTU Discovery
· Sending large MTU frames works routers
- Increases latency
- Do work on send side if you know MTU
· RFC 1191 - MTU discovery
- Send packet with "don't fragment" bit set
- Router returns ICMP error if too big
- Repeat with smaller frame size
- Disable with:� ndd -set /dev/ip ip_path_mtu_discovery 0
This RFC, like all others, may be found in one of the RFC repositories:
www.rfc-editor.org, www.ietf.org, www.faqs.org/rfcs
Version 3.10 LISA 2006 158
©1994-2006 Hal Stern, Marc Staveley
ARP Cache Management
· ARP cache maintains IP:MAC mappings
· May want to discard quickly
- Mobile IP addresses with multiple hardware addresses, or DHCP with rapid re-attachment
- Network equipment reboots
- HA failover when MAC address doesn't move
· Combined route/ARP entries at IP level� ndd -set /dev/ip ip_ire_cleanup_interval 30000
� ndd -set /dev/ip ip_ire_flush_interval 90000
· Local net ARP entries explicitly aged� ndd -set /dev/arp arp_cleanup_interval 60000
See SunWorld Online, February and April 1997
Version 3.10 LISA 2006 159
©1994-2006 Hal Stern, Marc Staveley
Application Architecture
Appendix A
Version 3.10 LISA 2006 160
©1994-2006 Hal Stern, Marc Staveley
Topics
· System programming
· Network programming & tuning
· Memory management
· Real-time design & data management
· Reliable computing
Version 3.10 LISA 2006 161
©1994-2006 Hal Stern, Marc Staveley
System Programming
Section A.1
Version 3.10 LISA 2006 162
©1994-2006 Hal Stern, Marc Staveley
What Can Go Wrong?
· Poor use of system calls
· Polling I/O
· Locking/semaphore operations
· Inefficient memory allocation or leaks
Version 3.10 LISA 2006 163
©1994-2006 Hal Stern, Marc Staveley
System Call Costs
· System calls are traps: serviced like page faults
· Easily abused with small buffer sizes
· Example
- read() and write() on pipe
sy cs us sy id
10 KBytes 271 41 4 12 84
1 KBytes 595 319 5 33 62
1 Byte 3733 2178 11 89 0
Version 3.10 LISA 2006 164
©1994-2006 Hal Stern, Marc Staveley
Using strace/truss
· Shows system calls and return values
- Locate calls that make process hang
- Debug permission problems
- Determine dynamic system call usage� % strace ls /lstat ("/", 0xf77ffbb8) = 0open("/", 0, 0) = 3brk(0xf210) = 0fcntl(3, 02, 0x1) = 0getdents(3, 0x9268, 8192) = 716
Using strace or truss greatly slows a process down. You're effectively putting a
kernel trap on every system call.
Collating Results
tracestat:
#!/bin/sh
awk '{
if ( $1 == "-" )
print $2
else
print $1
}' | sort | uniq -c
% strace xprocess | tracestat
13 close
87 getitimer
2957 ioctl
13 open
228 read
117 setitimer
582 sigblock
Version 3.10 LISA 2006 165
©1994-2006 Hal Stern, Marc Staveley
Synchronous Writes
· write() system call waits until disk is done
- Often 20 msec or more disk latency
- Reduces buffering/increases disk traffic
· Caused by
- Explicit flag in open()
- sync/update operation, or NFS write
- Closing file with outstanding writes (news spool)
· Typical usage
- Ensuring data delivery to disk, for strict ordering
- Side-effects
Close(2) is synchronous
waits for pending write(2)'s to complete fails if:
quota exceeded (EQUOTA)
filesystem full (EFSFULL)
Check the return value!
Version 3.10 LISA 2006 166
©1994-2006 Hal Stern, Marc Staveley
Eliminating Sync Writes
· NFS v3 or async mode in NFS v2
· Use file locking or semaphores
- Application ensures order of operations, not filesystem
- Better solution for multiple writers of same file
· Avoid open()-write()-close() loops
- Use syslog-like process for logging events
- Use database for preferences, history, environment
Version 3.10 LISA 2006 167
©1994-2006 Hal Stern, Marc Staveley
Network Programming
Section A.2
Version 3.10 LISA 2006 168
©1994-2006 Hal Stern, Marc Staveley
TCP/IP Buffering
· Segment sizes negotiated at connection
- Receiver advertises buffer up to 64K (48K)
- Sender can buffer more/less data
· Determine ideal buffer size on per-application
basis
- Global changes are harmful, can consume resources� setsockopt(..SO_RCVBUF..)
� setsockopt(..SO_SNDBUF..)
The global parameters on Solaris systems are set via ndd(1M):
tcp_xmit_hiwat, udp_xmit_hiwat for sending buffers
tcp_recv_hiwat, udp_recv_hiwat for receiving buffers
Global parameters in /sys/netinet/in_proto.c for BSD systems are:
tcp_sendspace, udp_sendspace
tcp_recvspace, udp_recvspace
Version 3.10 LISA 2006 169
©1994-2006 Hal Stern, Marc Staveley
TCP Transmit Optimization
· Small packets buffered on output
- Nagle algorithm buffers 2nd write until 1st is acknowledged
- Will delay up to 50 msec� setsockopt (..TCP_NODELAY..)
· Retransmit timer for PPP/dial-up nets� tcp_rexmit_interval_minDefault of 100 up to 1500
� tcp_rexmit_interval_initialDefault of 200 up to 2500
Version 3.10 LISA 2006 170
©1994-2006 Hal Stern, Marc Staveley
Connection Management
· High request volume floods connection queue
- BSD had implied limit of 5 connections
- Now tunable in most implementations
· Connection requires 3 host-host trips
- Client sends request to server
- Server sends packet to client
- Client completes connection
· Longer network latencies (Internet) require
deeper queue
Version 3.10 LISA 2006 171
©1994-2006 Hal Stern, Marc Staveley
Connections, Part II
· Change listen(5) to listen(20) or more
- 20-32000+ ideal for popular services like httpd� ndd -set /dev/tcp tcp_conn_req_max 100
· Socket addresses live on for 2 * MSL
- Database process crashes and restarts
- Can't bind to pre-defined address
- setsockopt(..SO_REUSEADDR..) doesn't help
· Decrease management timers� tcp_keepalive_interval (msec)
� tcp_close_wait_interval(msec) [Solaris <2.6]
� tcp_time_wait_interval (msec) [Solaris 2.6]
Determine the average backlog using a simple queuing theory formula: average
wait in a queue = service time * arrival rate
With an arrival rate of 150 requests a second, and a round trip handshake time
of 100 msec, you'll need a queue 15 requests deep. Note that 100 msec is just
about the latency of a packet from New York to California and back again.
Once you've fixed the connection and timeout problems, make sure you aren't
running out of file descriptors for key processes like inetd.
Version 3.10 LISA 2006 172
©1994-2006 Hal Stern, Marc Staveley
Memory Management
Section A.3
Version 3.10 LISA 2006 173
©1994-2006 Hal Stern, Marc Staveley
Address Space Layout
· Static areas: text, data
· Initialized data, including globals
· Uninitialized data (BSS)
· Growth
- Stack: grows down from top
- mmap: grows down from below stack limit
- Heap: grows up from top of BSS
Version 3.10 LISA 2006 174
©1994-2006 Hal Stern, Marc Staveley
Stack Management
· Local variables, parameters go on stack
· Don't put large data structures on stack
- Use malloc()
- Can damage window overlaps
Version 3.10 LISA 2006 175
©1994-2006 Hal Stern, Marc Staveley
Dynamic Memory Management
· malloc() and friends, free()
- Library calls on top of brk()
- Don't mix brk() and malloc()
- free() never shrinks heap, SZ is high-water mark
· Cell management
- malloc() puts cell size at beginning of block
- Allocates more than size requested
· Time- or space-optimized variants
- Try standard cell sizes
Version 3.10 LISA 2006 176
©1994-2006 Hal Stern, Marc Staveley
Typical Problems
· Memory leaks: SZ grows monotonically
· Address space fragmentation: MMU thrashing
· Data stride
- Access size matches cache size
- Repeatedly use 1 cache line
- Fix: move globals, resize arrays
· Use mmap() for sparsely accessed files
- More efficient than reading entire file into memory
Version 3.10 LISA 2006 177
©1994-2006 Hal Stern, Marc Staveley
mmap() or Shared Memory?
· Memory mapped files:
- Process coordination through file name
- Backed by filesystem, including NFS
- No swap space usage
- Writes may cause large page flush, better for reading
· Shared memory
- More set-up and coordination work with keys
- Backed by swap, not filesystem
- Need to explicitly write to disk
Version 3.10 LISA 2006 178
©1994-2006 Hal Stern, Marc Staveley
Real-Time Design
Section A.4
Version 3.10 LISA 2006 179
©1994-2006 Hal Stern, Marc Staveley
Why Worry About Real-Time?
· New computing problems
- Customer service with live transfer
- Real-time expectations of customers
- Web-based access
- If a user's in front of it, it's real time
· Predictable response times
- High volume transaction environment
- Threaded/concurrent programming models
· Things to learn from device drivers
Version 3.10 LISA 2006 180
©1994-2006 Hal Stern, Marc Staveley
System V Real-Time Features
· Real-time scheduling class
- Kernel pre-emption, including system calls
- Process promotion to avoid blocking chains
· No real-time network or filesystem code
· Resource allocation and "nail down"
- mlock(), plock() to lock memory/processes
· Move process into real-time class with priocntl
Version 3.10 LISA 2006 181
©1994-2006 Hal Stern, Marc Staveley
Real-Time Methodology
· Processes run for short periods
- Same model used by Windows
- Must allow scheduler to run: sleep or wait
- CPU-bound jobs will lock system
· Time quanta inversely proportional to priority
- Minimize latency to schedule key jobs
- Ship low-priority work to low-priority thread
· No filesystem/network dependencies
Version 3.10 LISA 2006 182
©1994-2006 Hal Stern, Marc Staveley
Summary
Version 3.10 LISA 2006 183
©1994-2006 Hal Stern, Marc Staveley
Parting Shots, Part 1
· Be greedy
- Solve for biggest gains first
- Don't over-tune or over-analyze
· Don't trust too much
- 3rd party code, libraries, blanket statements
- Verify information given to you by users
· Bottlenecks are converted
- Add network pipes, reduce latency, hurt server
- Fixing one problem creates 3 new ones
- Some speedups are superlinear
Version 3.10 LISA 2006 184
©1994-2006 Hal Stern, Marc Staveley
Parting Shots, Part 2
· Today's hot technology is tomorrow's capacity
headache
- Web browser caches, PCN, streaming video
- But taxing use leads to insurrection
· Rules change with each new release
- New features, new algorithms
- RFC compliance is creative art
· Nobody thanks you for being pro-active
- But you should be!