81650923 hp ux performance and tuning h4262s

HP Training

Student guide

StudentguideHP-UX Performance and TuningH4262S C.00

Copyright 2004 Hewlett-Packard Development Company, L.P.

The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

This is an HP copyrighted work that may not be reproduced without the written permission of HP. You may not use these materials to deliver training to any person outside of your organization without the written permission of HP.

UNIX® is a registered trademark of the Open Group.

Printed in the USA

HP-UX Performance and TuningStudent guideMay 2004

Contents

http://education.hp.com H4262S C.00 2004 Hewlett-Packard Development Company, L.P.

iii

Contents Overview............................................................................................................................................. 1

Module 1 Introduction

1–1. SLIDE: Welcome to HP-UX Performance and Tuning.................................................. 1-2 1–2. SLIDE: Course Outline ..................................................................................................... 1-3 1–3. SLIDE: System Performance ........................................................................................... 1-4 1–4. SLIDE: Areas of Performance Problems........................................................................ 1-6 1–5. SLIDE: Performance Bottlenecks................................................................................... 1-8 1–6. SLIDE: Baseline .............................................................................................................. 1-10 1–7. SLIDE: Queuing Theory of Performance ..................................................................... 1-12 1–8. SLIDE: How Long Is the Line?....................................................................................... 1-13 1–9. SLIDE: Example of Queuing Theory ............................................................................ 1-14 1–10. SLIDE: Summary............................................................................................................. 1-16 1–11. LAB: Establishing a Baseline......................................................................................... 1-17 1–12. LAB: Verifying the Performance Queuing Theory ...................................................... 1-19

Module 2 Performance Tools

2–1. SLIDE: HP-UX Performance Tools...................................................................................... 2-2 2–2. SLIDE: HP-UX Performance Tools (Continued) ............................................................... 2-3 2–3. SLIDE: Sources of Tools....................................................................................................... 2-4 2–4. SLIDE: Types of Tools .......................................................................................................... 2-6 2–5. SLIDE: Criteria for Comparing the Tools ........................................................................... 2-8 2–6. SLIDE: Data Sources........................................................................................................... 2-10 2–7. SLIDE: Performance Monitoring Tools (Standard UNIX).............................................. 2-11 2–8. TEXT PAGE: iostat ......................................................................................................... 2-12 2–9. TEXT PAGE: ps................................................................................................................... 2-14 2–10. TEXT PAGE: sar .............................................................................................................. 2-16 2–11. TEXT PAGE: time, timex .............................................................................................. 2-18 2–12. TEXT PAGE: top .............................................................................................................. 2-19 2–13. TEXT PAGE: uptime, w................................................................................................... 2-21 2–14. TEXT PAGE: vmstat ....................................................................................................... 2-22 2–15. SLIDE: Performance Monitoring Tools (HP Specific) .................................................. 2-25 2–16. TEXT: glance................................................................................................................... 2-26 2–17. TEXT PAGE: gpm .............................................................................................................. 2-28 2–18. TEXT PAGE: xload.......................................................................................................... 2-30 2–19. SLIDE: Data Collection Performance Tools (Standard UNIX) .................................... 2-31 2–20. TEXT PAGE: acct Programs .......................................................................................... 2-32 2–21. TEXT PAGE: sar .............................................................................................................. 2-34 2–22. SLIDE: Data Collection Performance Tools (HP-Specific) .......................................... 2-36 2–23. TEXT PAGE: MeasureWare/OVPA and DSI Software................................................... 2-37 2–24. TEXT PAGE: PerfView/OVPM ......................................................................................... 2-39 2–25. SLIDE: Network Performance Tools (Standard UNIX)................................................ 2-41 2–26. TEXT PAGE: netstat..................................................................................................... 2-42 2–27. TEXT PAGE: nfsstat..................................................................................................... 2-44 2–28. TEXT PAGE: ping ............................................................................................................ 2-46 2–29. SLIDE: Network Performance Tools (HP-Specific) ...................................................... 2-48 2–30. TEXT PAGE: lanadmin .................................................................................................. 2-49

Contents

H4262S C.00 http://education.hp.com 2004 Hewlett-Packard Development Company, L.P.

iv

2–31. TEXT PAGE: lanscan .....................................................................................................2-51 2–32. TEXT PAGE: nettune (HP-UX 10.x Only)....................................................................2-53 2–33. TEXT PAGE: ndd (HP-UX 11.x Only) ..............................................................................2-55 2–34. TEXT PAGE: NetMetrix (HP-UX 10.20 and 11.0 Only)..................................................2-57 2–35. SLIDE: Performance Administrative Tools (Standard UNIX) ......................................2-58 2–36. TEXT PAGE: ipcs, ipcrm...............................................................................................2-59 2–37. TEXT PAGE: nice, renice ............................................................................................2-61 2–38. SLIDE: Performance Administrative Tools (HP-Specific) ............................................2-63 2–39. Text Page: getprivgrp, setprivgrp.........................................................................2-64 2–40. Text Page: rtprio ............................................................................................................2-66 2–41. Text Page: rtsched..........................................................................................................2-67 2–42. Text Page: scsictl..........................................................................................................2-69 2–43. Text Page: serialize .....................................................................................................2-71 2–44. Text Page: fsadm...............................................................................................................2-72 2–45. Text Page: getext, setext............................................................................................2-74 2–46. Text Page: newfs, tunefs, vxtunefs .........................................................................2-75 2–47. Text Page: Process Resource Manager (PRM)..............................................................2-77 2–48. Text Page: Work Load Manager (WLM) .........................................................................2-78 2–49. Text Page: Web Quality of Service — WebQoS..............................................................2-79 2–50. SLIDE: System Configuration and Utilization Information (Standard UNIX)............2-80 2–51. TEXT PAGE: bdf, df ........................................................................................................2-81 2–52. TEXT PAGE: mount ..........................................................................................................2-83 2–53. SLIDE: System Configuration and Utilization Information (HP-Specific) ..................2-84 2–54. TEXT PAGE: diskinfo ...................................................................................................2-85 2–55. TEXT PAGE: dmesg ..........................................................................................................2-86 2–56. TEXT PAGE: ioscan........................................................................................................2-88 2–57. TEXT PAGE: vgdisplay, pvdisplay, lvdisplay .................................................2-90 2–58. TEXT PAGE: swapinfo ...................................................................................................2-92 2–59. TEXT PAGE: sysdef........................................................................................................2-93 2–60. TEXT PAGE: kmtune, kcweb..........................................................................................2-95 2–61. SLIDE: Application Profiling and Monitoring Tools (Standard UNIX) .......................2-96 2–62. TEXT PAGE: prof, gprof...............................................................................................2-97 2–63. Text page: Application Response Measurement (ARM) Library Routines ................2-98 2–64. SLIDE: Application Profiling and Monitoring Tools (HP-Specific) ............................2-99 2–65. Text page: Transaction Tracker .....................................................................................2-100 2–66. Text page: caliper — HP Performance Analyzer..........................................................2-101 2–67. SLIDE: Summary ..............................................................................................................2-103 2–68. LAB: Performance Tools Lab..........................................................................................2-104

Module 3 GlancePlus

3-1. SLIDE: This Is GlancePlus................................................................................................3-2 3-2. SLIDE: GlancePlus Pak Overview...................................................................................3-4 3-3. SLIDE: gpm and glance .................................................................................................3-6 3-4. SLIDE: glance — The Character Mode Interface ......................................................3-8 3-5. SLIDE: Looking at a glance Screen ..............................................................................3-11 3-6. SLIDE: gpm — The Graphical User Interface ..............................................................3-13 3-7. SLIDE: Process Information ..........................................................................................3-15 3-8. SLIDE: Adviser Components .........................................................................................3-17 3-9. SLIDE: adviser Bottleneck Syntax Example............................................................3-18 3-10. SLIDE: The parm File .....................................................................................................3-19

Contents


v

3-11. SLIDE: GlancePlus Data Flow....................................................................................... 3-21 3-12. SLIDE: Key GlancePlus Usage Tips.............................................................................. 3-23 3-13. SLIDE: Global, Application, and Process Data ........................................................... 3-24 3-14. SLIDE: Can't Solve What's Not a Problem................................................................... 3-25 3-15. SLIDE: Metrics: "No Answers without Data"............................................................... 3-26 3-16. SLIDE: Summary............................................................................................................. 3-27 3-17. SLIDE: HP GlancePlus Guided Tour ............................................................................ 3-28 3-18. LAB: gpm and glance Walk-Through ............................................................................ 3-29

Module 4 Process Management

4–1. SLIDE: The HP-UX Operating System............................................................................ 4-2 4–2. SLIDE: Virtual Address Process Space (PA-RISC) ....................................................... 4-4 4–3. SLIDE: Virtual Address Process Space (IA-64) ............................................................. 4-6 4–4. SLIDE: Physical Process Components........................................................................... 4-7 4–5. SLIDE: The Life Cycle of a Process ................................................................................ 4-9 4–6. SLIDE: Process States .................................................................................................... 4-11 4–7. SLIDE: CPU Scheduler................................................................................................... 4-14 4–8. SLIDE: Context Switching ............................................................................................. 4-16 4–9. SLIDE: Priority Queues .................................................................................................. 4-17 4–10. SLIDE: Nice Values......................................................................................................... 4-19 4–11. SLIDE: Parent-Child Process Relationship.................................................................. 4-20 4–12. SLIDE: glance — Process List.................................................................................... 4-21 4–13. SLIDE: glance — Individual Process......................................................................... 4-23 4–14. SLIDE: glance — Process Memory Regions............................................................. 4-24 4–15. SLIDE: glance — Process Wait States....................................................................... 4-25 4–16. LAB: Process Management ............................................................................................ 4-26

Module 5 CPU Management

5–1. SLIDE: Processor Module................................................................................................ 5-2 5–2. SLIDE: Symmetric Multiprocessing................................................................................ 5-4 5–3. SLIDE: Cell Module .......................................................................................................... 5-5 5–4. SLIDE: Multi-Cell Processing .......................................................................................... 5-6 5–5. SLIDE: CPU Processor..................................................................................................... 5-8 5–6. SLIDE: CPU Cache ......................................................................................................... 5-11 5–7. SLIDE: TLB Cache .......................................................................................................... 5-12 5–8. SLIDE: TLB, Cache, and Memory ................................................................................. 5-14 5–9. SLIDE: HP-UX — Performance Optimized Page Sizes............................................... 5-16 5–10. SLIDE: CPU — Metrics to Monitor Systemwide......................................................... 5-19 5–11. SLIDE: CPU — Metrics to Monitor per Process ......................................................... 5-21 5–12. SLIDE: Activities that Utilize the CPU ......................................................................... 5-23 5–13. SLIDE: glance — CPU Report .................................................................................... 5-25 5–14. SLIDE: glance — CPU by Processor ......................................................................... 5-26 5–15. SLIDE: glance — Individual Process......................................................................... 5-27 5–16. SLIDE: glance — Global System Calls ..................................................................... 5-28 5–17. SLIDE: glance — System Calls by Process............................................................... 5-29 5–18. SLIDE: sar Command ................................................................................................... 5-30 5–19. SLIDE: timex Command .............................................................................................. 5-32 5–20. SLIDE: Tuning a CPU-Bound System — Hardware Solutions .................................. 5-33 5–21. SLIDE: Tuning a CPU-Bound System — Software Solutions.................................... 5-35 5–22. SLIDE: CPU Utilization and MP Systems..................................................................... 5-36

Contents


vi

5–23. SLIDE: Processor Affinity ..............................................................................................5-37 5-24. LAB: CPU Utilization, System Calls, and Context Switches ......................................5-38 5–25. LAB: Identifying CPU Bottlenecks ................................................................................5-40

Module 6 Memory Management

6–1. SLIDE: Memory Management ..........................................................................................6-2 6–2. SLIDE: Memory Management — Paging ........................................................................6-4 6–3. SLIDE: Paging and Process Deactivation.......................................................................6-5 6–4. SLIDE: The Buffer Cache .................................................................................................6-7 6–5. SLIDE: The syncer Daemon ..........................................................................................6-9 6–6. SLIDE: IPC Memory Allocation .....................................................................................6-10 6–7. SLIDE: Memory Metrics to Monitor — Systemwide...................................................6-12 6–8. SLIDE: Memory Metrics to Monitor — per Process ...................................................6-14 6–9. SLIDE: Memory Monitoring vmstat Output...............................................................6-16 6–10. SLIDE: Memory Monitoring glance — Memory Report...........................................6-18 6–11. SLIDE: Memory Monitoring glance — Process List.................................................6-19 6–12. SLIDE: Memory Monitoring glance — Individual Process......................................6-20 6–13. SLIDE: Memory Monitoring glance — System Tables.............................................6-21 6–14. SLIDE: Tuning a Memory-Bound System — Hardware Solutions ............................6-23 6–15. SLIDE: Tuning a Memory-Bound System — Software Solutions ..............................6-24 6-16: SLIDE: PA-RISC Access Control ...................................................................................6-26 6–17. SLIDE: The serialize Command..............................................................................6-28 6–18. LAB: Memory Leaks ........................................................................................................6-30

Module 7 Swap Space Performance

7–1. SLIDE: Swap Space Management — Simple View....................................................... 7-2 7–2. SLIDE: Swap Space — After a New Process Executes ............................................... 7-4 7–3. SLIDE: The swapinfo Command ................................................................................. 7-5 7–4. SLIDE: Swap Space Management — Realistic View.................................................... 7-7 7–5. SLIDE: Swap Space — After a New Process Executes ............................................... 7-8 7–6. SLIDE: Swap Space — When Memory Equals Data Swapped.................................. 7-10 7–7. SLIDE: Swap Space — When Swap Space Fills Up ................................................... 7-11 7–8. SLIDE: Pseudo Swap ..................................................................................................... 7-12 7–9. SLIDE: Total Swap Space Calculation — with Pseudo Swap................................... 7-14 7–10. SLIDE: Example Situation Using Pseudo Swap ......................................................... 7-16 7–11. SLIDE: Swap Priorities .................................................................................................. 7-17 7–12. SLIDE: Swap Chunks ..................................................................................................... 7-18 7–13. SLIDE: Swap Space Parameters ................................................................................... 7-19 7–14. SLIDE: Summary ............................................................................................................ 7-21 7–15. LAB: Monitoring Swap Space ....................................................................................... 7-22

Module 8 Disk Performance Issues

8–1. SLIDE: Disk Overview ......................................................................................................8-2 8–2. SLIDE: Disk I/O — Read Data Flow................................................................................8-4 8–3. SLIDE: Disk I/O — Write Data Flow (Synchronous) ....................................................8-6 8–4. SLIDE: Disk Metrics to Monitor — Systemwide ...........................................................8-8 8–5. SLIDE: Disk Metrics to Monitor — Per Process..........................................................8-10 8–6. SLIDE: Activities that Create a Large Amount of Disk I/O.........................................8-12 8–7. SLIDE: Disk I/O Monitoring sar –d Output...............................................................8-14 8–8. SLIDE: Disk I/O Monitoring sar –b Output...............................................................8-16 8–9. SLIDE: Disk I/O Monitoring glance — Disk Report......................................................8-18

Contents


vii

8–10. SLIDE: Disk I/O Monitoring glance — Disk Device I/O .......................................... 8-19 8–11. SLIDE: Disk I/O Monitoring glance — Logical Volume I/O....................................... 8-20 8–12. SLIDE: Disk I/O Monitoring glance — System Calls per Process.......................... 8-21 8–13. SLIDE: Tuning a Disk I/O-Bound System —

Hardware Solutions ........................................................................................................ 8-22 8–14. SLIDE: Tuning a Disk I/O-Bound System —

Perform Asynchronous Meta-data I/O.......................................................................... 8-24 8–15. SLIDE: Tuning a Disk I/O-Bound System —

Load Balance across Disk Controllers ......................................................................... 8-26 8–16. SLIDE: Tuning a Disk I/O-Bound System —

Load Balance across Disk Drives.................................................................................. 8-28 8–17. SLIDE: Tuning a Disk I/O-Bound System —

Tune Buffer Cache.......................................................................................................... 8-30 8–18. LAB: Disk Performance Issues...................................................................................... 8-33

Module 9 HFS File System Performance

9–1. SLIDE: HFS File System Overview ................................................................................. 9-2 9–2. SLIDE: Inode Structure .................................................................................................... 9-5 9–3. SLIDE: Inode Data Block Pointers ................................................................................. 9-6 9–4. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd? ................. 9-8 9–5. SLIDE: File System Blocks and Fragments ................................................................. 9-10 9–6. SLIDE: Creating a New File on a Full File System ..................................................... 9-13 9–7. SLIDE: HFS Metrics to Monitor — Systemwide ......................................................... 9-15 9–8. SLIDE: Activities that Create a Large Amount of File System I/O............................ 9-17 9–9. SLIDE: HFS I/O Monitoring bdf Output ...................................................................... 9-18 9–10. SLIDE: HFS I/O Monitoring glance — File System I/O ........................................... 9-19 9–11. SLIDE: HFS I/O Monitoring glance — File Opens per Process.............................. 9-20 9–12. SLIDE: Tuning a HFS I/O-Bound System — Tune Configuration for Workload ..... 9-22 9–13. SLIDE: Tuning a HFS I/O-Bound System — Use Fast Links...................................... 9-25 9–14. LAB: HFS Performance Issues ...................................................................................... 9-27

Module 10 VxFS Performance Issues

10–1. SLIDE: Objectives ........................................................................................................... 10-2 10–2. SLIDE: JFS History and Version Review...................................................................... 10-5 10–3. SLIDE: JFS Extents......................................................................................................... 10-9 10–4. SLIDE: Extent Allocation Policies .............................................................................. 10-11 10–5. SLIDE: JFS Intent Log .................................................................................................. 10-13 10–6. SLIDE: Intent Log Data Flow....................................................................................... 10-16 10–7. SLIDE: Understand Your I/O Workload ..................................................................... 10-18 10–8. SLIDE: Performance Parameters ................................................................................ 10-20 10–9. SLIDE: Choosing a Block Size..................................................................................... 10-21 10–10. SLIDE: Choosing an Intent Log Size ........................................................................... 10-23 10–11. SLIDE: Intent Log Mount Options............................................................................... 10-25 10–12. SLIDE: Other JFS Mount Options ............................................................................... 10-27 10–13. SLIDE: JFS Mount Option: mincache=direct ...................................................... 10-31 10-14. SLIDE: JFS Mount Option: mincache=tmpcache ................................................. 10-33 10–15. SLIDE: Kernel Tunables ............................................................................................... 10-35 10–16. SLIDE: Fragmentation.................................................................................................. 10-37 10–17. TEXT PAGE: Monitoring and Repairing File Fragmentation .................................. 10-40 10–18. SLIDE: Using setext .................................................................................................. 10-50 10–19. SLIDE: I/O Tunable Parameters .................................................................................. 10-52

Contents


viii

10–20. SLIDE: vxtunefs Command for Tuning VxFS ........................................................10-54 10–21. SLIDE: /etc/vx/tunefstab Configuration ..........................................................10-56 10–22. SLIDE: Taking Snapshots and Performance ..............................................................10-58 10–23. LAB: JFS File System Tuning.......................................................................................10-60

Module 11 Network Performance

11–1. SLIDE: The OSI Model ....................................................................................................11-2 11–2. SLIDE: NFS Read/Write Data Flow...............................................................................11-4 11–3. SLIDE: NFS on HP-UX with UDP ..................................................................................11-6 11–4. SLIDE: NFS on HP-UX with TCP...................................................................................11-7 11–5. SLIDE: biod on Client .....................................................................................................11-9 11–6. SLIDE: TELNET.............................................................................................................11-11 11–7. SLIDE: FTP.....................................................................................................................11-13 11–8. SLIDE: Metrics to Monitor — NFS..............................................................................11-15 11–9. SLIDE: Metrics to Monitor — Network ......................................................................11-18 11–10. SLIDE: Determining the NFS Workload .....................................................................11-20 11–11. SLIDE: NFS Monitoring — nfsstat Output ............................................................11-23 11–12. SLIDE: Network Monitoring — lanadmin Output ..................................................11-28 11–13. SLIDE: Network Monitoring — netstat –i Output ................................................11-31 11–14. SLIDE: glance — NFS Report ......................................................................................11-32 11–15. SLIDE: glance — NFS System Report.........................................................................11-33 11–16. SLIDE: glance — Network by Interface Report.........................................................11-34 11–17. SLIDE: Tuning NFS .......................................................................................................11-35 11–18. SLIDE: Tuning the Network.........................................................................................11-37 11–19. SLIDE: Tuning the Network (Continued)...................................................................11-39 11–20. LAB: Network Performance.........................................................................................11-41

Module 12 Tunable Kernel Parameters

12–1. SLIDE: Kernel Parameter Classes .................................................................................12-2 12–2. SLIDE: Tuning the Kernel...............................................................................................12-5 12–3. SLIDE: Kernel Parameter Categories............................................................................12-8 12–4. SLIDE: File System Kernel Parameters ........................................................................12-9 12–5. SLIDE: Message Queue Kernel Parameters ...............................................................12-11 12–6. SLIDE: Semaphore Kernel Parameters.......................................................................12-13 12–7. SLIDE: Shared Memory Kernel Parameters ...............................................................12-15 12–8. SLIDE: Process-Related Kernel Parameters ..............................................................12-17 12–9. SLIDE: Memory-Related Kernel Parameters..............................................................12-19 12–10. SLIDE: LVM-Related Kernel Parameters ....................................................................12-21 12–11. SLIDE: Networking-Related Kernel Parameters........................................................12-22 12–12. SLIDE: Miscellaneous Kernel Parameters..................................................................12-23

Module 13 Putting It All Together

13–1. SLIDE: Review of Bottleneck Characteristics .............................................................13-2 13–2. SLIDE: Performance Monitoring Flowchart ................................................................13-4 13–3. SLIDE: Review — Memory Bottlenecks.......................................................................13-6 13–4. SLIDE: Correcting Memory Bottlenecks ......................................................................13-7 13–5. SLIDE: Review — Disk Bottlenecks .............................................................................13-8 13–6. SLIDE: Correcting Disk Bottlenecks.............................................................................13-9 13–7. SLIDE: Review — CPU Bottlenecks ...........................................................................13-11 13–8. SLIDE: Correcting CPU Bottlenecks...........................................................................13-12 13–9. SLIDE: Final Review — Major Symptoms..................................................................13-13

Contents


ix

Appendix A — Applying GlancePlus Data

A–1. TEXT PAGE: Case Studies — Using GlancePlus ............................................................. A-2 Solutions

Contents


x

Overview


1

Overview

Course Description

This course is intended to introduce students to the various aspects of monitoring and tuning their systems. Students will be taught how to monitor which tools to use, symptoms to look for, and what remedial actions to take. The course also covers HP GlancePlus/Gpm and HP PerfRx. The course is designed to: • Introduce the subject of performance and tuning. • Describe how the system works. • Identify what tools we can use to look at performance. • Identify the symptoms we may encounter and what they indicate.

Course Goals

• To educate the students on HP-UX performance monitoring • To enable them to identify bottlenecks and potential problems • To learn the appropriate remedial actions to take

Student Performance Objectives

Module 1 — Introduction

• List characteristics of a system yielding good user response time.

• List characteristics of a system yielding high data throughput.

• List three generic areas most often analyzed for performance.

• List the four most common bottlenecks on a system.

Module 2 — Performance Tools

• Identify various performance tools available on HP-UX.

• Categorize each tool as either real time or data collection.

• List the major features of the performance tools.

• Compare and contrast the differences between the tools

Module 3 — GlancePlus

• Compare GlancePlus with other performance monitoring/management tools.

• Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).

Overview


2

Module 4 — Process Management

• Describe the components of a process.

• Describe how a process executes, and identify its process states.

• Describe the CPU scheduler.

• Describe a context switch and the circumstances under which context switching occurs.

• Describe in general, the HP-UX priority queues.

Module 5 — CPU Management

• Describe the components of the processor module.

• Describe how the TLB and CPU cache are used.

• List four CPU related metrics.

• Identify how to monitor CPU activity.

• Discuss how best to use the performance tools to diagnose CPU problems.

• Specify appropriate corrections for CPU bottlenecks.

Module 6 — Memory Management

• Describe how the HP-UX operating system performs memory management.

• Describe the main performance issues that involve memory management.

• Describe the UNIX buffer cache.

• Describe the sync process.

• Identify the symptoms of a memory bottleneck.

• Identify global and process memory metrics.

• Use performance tools to diagnose memory problems.

• Specify appropriate corrections for memory bottlenecks.

• Describe the function of the serialize command.

Overview


3

Module 7 — Swap Space Performance

• Describe the difference between swap usage and swap reservation.

• Interpret the output of the swapinfo command.

• Define and configure pseudo swap.

• Define and configure swap space priorities.

• Define and configure swchunk and maxswapchunks.

Module 8 — Disk Performance Issues

• List three ways disk space can be used.

• List disk device files.

• Identify disk bottlenecks.

• Identify kernel system parameters.

Module 9 — File System Performance

• List three ways file systems are used.

• List basic file system data structures.

• Identify file system bottlenecks.


Module 10 — VxFS Performance

• Understand JFS structure and version differences

• Explain how to enhance JFS performance

• Set block sizes to improve performance

• Set Intent-Log size and rules to improve performance

• Understand and manipulate synchronous and asynchronous IO

• Identify JFS tuning parameters

• Understand and control fragmentation issues

• Evaluate the overhead of online backup snapshots

Overview


4

Module 11 — NFS Performance

• List factors directly related to network performance.

• Describe how to determine network workloads (server and client).

• Evaluate UDP and TCP transport options.

• Identify a network bottleneck.

• List possible solutions for a network performance problem.


• Identify which tunable parameters belong to which category

• Identify tunable kernel parameters that could impact performance

• Tune both static and dynamic tunable parameters

Module 13 — Putting It All Together

• Identify and characterize some network performance problems.

• List some useful tools for measuring network performance problems and state how they might be applied.

• Identify bottlenecks on other common system devices not associated directly with the CPU, disk, or memory.

Overview


5

Student Profile and Prerequisites

The student should be well versed in UNIX and able to perform the usual duties of a system administrator. Students should have completed HP-UX System and Network Administration I and HP-UX System and Network Administration II prior to attending this course or equivalent experience on another manufacturer's equipment. NOTE: The course Inside HP-UX (H5081S) is not a formal prerequisite to attending

HP-UX Performance and Tuning, but it should be considered a co-requisite training course for the serious HP-UX Performance Specialist. (The suggested order is Inside HP-UX then HP-UX Performance Tuning, but as the two courses have a synergistic relationship, the order is not absolute).

Curriculum Path

Fundamentals of UNIX (H51434S) | | HP-UX System and Network OR HP-UX Administration for Administration I the Experienced UNIX (H3064S) Administrator | (H5875S) | | HP-UX System and Network | Administration II | (H3065S) | Inside HP-UX

(H50815S) Recommended

HP-UX Performance

and Tuning (H4262S)

Overview


6

Agenda

The following agenda is only a guideline. The instructor may vary it if desired. The course will run until the afternoon of the third day. The last hour or so can be used to demonstrate more fully the performance offerings, such as HP PRM and HP PerfView. Day 1 1 — Introduction 2 — Performance Tools Day 2 3 — GlancePlus 4 — Process Management 5 — CPU Management Day 3 6 — Memory Management 7 — Swap Space PerformanceManaging 8 — Disk Performance Issues Day 4 9 — File System Performance 10 ---- VxFS Performance 11 — NFS Performance Day 5 12 — Tunable Kernel Parameters 13 — Putting It All Together


1-1


Objectives

Upon completion of this module, you will be able to do the following:

• List characteristics of a system yielding good user response time.

• List characteristics of a system yielding high data throughput.

• List three generic areas most often analyzed for performance.

• List the four most common bottlenecks on a system.


H4262S C.00 1-2 http://education.hp.com © 2004 Hewlett-Packard Development Company, L.P.

1–1. SLIDE: Welcome to HP-UX Performance and Tuning

Student Notes Welcome to the HP-UX Performance and Tuning course. This course is designed to provide a high level understanding of common performance problems and common bottlenecks found on an HP-UX system. The course uses HP performance tools to view activity currently on the system. While many tools can be used to analyze the activity, this course primarily utilizes the glance tool, which is specifically tailored for HP-UX systems.

Welcome to HP-UX Performance and Tuning



1-3

1–2. SLIDE: Course Outline

Student Notes Topics covered in this course include:

• System Internals This module includes information related to how the system components (CPU, memory, file systems, and network) function and interact with each other. Similar to a mechanic not being able to tune a car's engine until he understands how it works, a system administrator cannot tune system resources properly until he has a good understanding of how the resources work.

• Performance Tools There are many performance tools that are available with HP-UX. Some tools come as standard equipment; other tools are additional add-on products. Some tools provide run-time monitoring; other tools perform data collection. We will review all of the tools.

• Specialty Areas These modules cover areas of special interest to customers in particular types of environments. Three specialty areas are covered at a high level. These are NFS and networking, databases, and application profiling.

Course Outline

• Introduction to Performance• Performance Tools – Overview• GlancePlus• Process Management• CPU Management• Memory Management • Swap Space Performance Issues• Disk and File System Performance Issues• HFS Performance Issues• VxFS Performance Issues• Network Performance Issues• Tuning the Kernel• Putting it All Together – Performance Recap



1–3. SLIDE: System Performance

Student Notes Different computer systems have different requirements. Some systems may need to provide quick response time; other systems may need to provide a high level of data throughput.

Response Time — User's Perspective

Response time is the time between the instant the user presses the return key or the mouse button and the receipt of a response from the computer. Users often use response time as a criterion of system performance. A system that yields high response is typically not 100% utilized. Often there are free CPU cycles, along with low utilization of disk drives, and with no swapping or paging. Because the system resources are not being utilized constantly, often when a user executes a task the resources are available immediately, yielding quick response time to the user. Users want low utilization of resources in order to experience optimal response time.

System Performance

System ThroughputResponse Time

Computer System

Users Management



1-5

Throughput — IT or MIS Management Perspective

Throughput is the number of transactions accomplished in a fixed-time period. Management is often interested in how many compilations or how many reports they can generate in a specific amount of time. Many systems use benchmarks (like SPECmarks or TPC), which measure, in general, how many operations or transactions a system can perform per minute. A system that yields high throughput is typically 100% utilized. There are no free CPU cycles; there are always jobs in the CPU run queue; the disk drives are constantly being utilized; and there is often pressure on memory. Because the system resources are constantly in use, the amount of work produced typically yields good system throughput. Management wants high utilization of resources to maximize system performance.

Question

Is it possible to get both good response time and high system throughput?



1–4. SLIDE: Areas of Performance Problems

Student Notes This slide shows a hierarchical view of a computer system. The base of a computer is its hardware. Built on top of the hardware is the operating system (i.e. the operating system is dependent on the hardware in order to run). The application programs are built on top of the operating system (OS). All three of these areas can have performance problems.

Hardware

The hardware moves data within the computer system. If the hardware is slow, then, no matter how finely tuned the OS and applications are, the system will still be slow. Ultimately, the system is only as fast as the hardware can move the data. Items affecting the speed of the hardware include CPU clock speed, amount and speed of memory, type of disk controller (Fast/Wide SCSI or Single-Ended SCSI), and type of network card (FDDI or Ethernet).

Areas of Performance Problems

Hardware

Operating System

Application



1-7

Operating System

The operating system runs on top of the hardware. It controls how the hardware is utilized. The operating system decides which process runs on the CPU, how much memory to allocate for the buffer cache, whether I/O to the disks is performed synchronously or asynchronously, and so on. If the operating system is not configured properly, then the performance of the system will be poor. Items affecting how the operating system performs include process priorities and their nice values, the tunable OS parameters, the mount options used for file systems, and the configurations of network and swap devices

Applications

The applications run on top of the operating system. The application programs include software, such as database management systems, electronic design applications programs, and accounting-based applications. The performance of the application program is dependent on the operating system and hardware, but it is also dependent on how the application is coded, and how the application itself is configured. Items affecting the performance of the application include how the application data is laid out on the disk, how many users are trying to use the application currently, and how efficiently the application uses the system's resources.

Questions

In which of these three areas are most performance problems located?



1–5. SLIDE: Performance Bottlenecks

Student Notes Poor performance often results because a given resource cannot handle the demand being placed upon it. When the demand for a resource exceeds the availability of the resource, then a bottleneck exists for that resource. Common resource bottlenecks are: CPU A CPU bottleneck occurs when the number of processes wanting to execute is

constantly more than the CPU can handle. Basic symptoms of a CPU bottleneck are high CPU Utilization and multiple jobs in the CPU run queue, consistently.

Memory A memory bottleneck occurs when the total number of processes on the system

will not all fit into memory at one time (i.e. there are more processes than memory can hold). When this happens, pages of memory need to be copied out to the swap partition on disk to free space in memory. Basic symptoms of a memory bottleneck are high memory utilization and consistent I/O activity to the swap partition on disk.

Performance Bottlenecks

Memory

CPU

CPU Run Queue

Disk

Network

Disk I/O Queue

Processes

System Bottleneck Areas• CPU• Memory• Disk• Network



1-9

Disk A disk bottleneck occurs when the amount of I/O to a specific disk is more than the disk can handle. Basic symptoms of a disk bottleneck include high utilization of a disk drive and multiple I/O requests consistently in the disk I/O queue.

Network A network bottleneck occurs when the amount of time needed to perform

network-based transactions is consistently greater than expected. Basic symptoms of a network bottleneck include network collisions, network request timeouts, and packet retransmissions.



1–6. SLIDE: Baseline

Student Notes In order to quantify good versus poor performance, a customer needs to know what the best possible response time for a given workload can be. The procedure for calculating the best possible response time for a given workload is known as baselining. To calculate the baseline (i.e. the best possible response time) for a particular workload, the workload needs to be performed when no other activity is on the system. The intent is that when all resources are free, the workload will be able to execute as quickly as possible, thereby yielding the best possible response time. Once the baseline value is known, a relative measure is now available for determining how poorly the workload is performing. For example, assume a baseline value of 5 seconds for the workload shown on the slide. When five users are on the system, the response time for the workload increases to 7 seconds. The relative comparison shows response time taking 40% (or 2 seconds) more time to perform this workload when five users are on the system. We have just quantified the relative effect of having five users on the system relative to this particular workload.

Baseline

Best PossibleResponse Time

Response Timewith Five Users

Response Timewith Ten Users

Response Timewith Fifteen Users

ResponseTime



1-11

The slide illustrates the typical behavior for a given workload. As more users concurrently utilize the system, the response time for a given workload gets worse. NOTE: In this class we will run baseline metrics using simplified "workload"

simulation programs. Results will vary greatly with your applications.



1–7. SLIDE: Queuing Theory of Performance

Student Notes The queuing theory of performance states that the average response time of a given resource is directly linked to the average utilization of that resource. The slide shows a baseline value of X seconds for a given resource. According to the queuing theory, the users will experience this response time when the resource has an average utilization of 0 to 25%. When the average utilization of the resource reaches 75%, the average response time will double. As the average utilization approaches 100%, the average response time quadruples. The bottom line is, as the average utilization of the resource increases, the average response time gets worse and worse. Why does the average response time become poor as the average utilization of a resource increases?

Queuing Theory of Performance

BaselineX

4X

3X

2X

25 50 100PercentUtilization

ResponseTime

75



1-13

1–8. SLIDE: How Long Is the Line?

Student Notes The reason why the average response time gets so poor when the average resource utilization increases, is because the length of the line waiting to get to the resource gets longer. As resource utilization increases, the number of jobs waiting on the resource also increases. When poor performance is experienced, it is most often due to the length of the queue becoming long. A long queue causes jobs to spend most of their time waiting in line for the resource (CPU, memory, network, or disk), as opposed to being serviced by the resource. The slide shows four people waiting in line to get to a resource (think of a line in a bank with one bank teller). If it takes 5 minutes to service one customer, then the fourth person in line will wait 15 minutes before reaching the resource. Adding another 5 minutes to service, the request brings the total response time to 20 minutes for the last person in line, as opposed to 5 minutes if the line had been empty. Of course there is also an overhead experienced because of “switching” from one customer to the next. This switching is minimal in this example because the customers are handled in a serial fashion.

How Long Is the Line?

SystemResource

The LineStarts Here



1–9. SLIDE: Example of Queuing Theory

Student Notes The above slide provides an example of the queuing theory for the disk drives as reported with the sar tool. The four fields to focus on are: %busy The percentage of utilization of each disk

avque The average number of I/O requests in the queue for that disk

avwait The average amount of time a request spends waiting in that disk’s queue

avserv The average amount of time to service that I/O request (not including the wait time)

Analyzing the data shows a baseline around 20 milliseconds to service an I/O request (approximate average of avserv column). The first line item shows a disk that is 81% utilized. The total response time is the average wait plus the average service, or approximately 80 milliseconds. This is four times longer than the baseline of 20 milliseconds. In fact, each snapshot shows the busy disk waiting in the queue for an amount of time greater than the amount of time to service the I/O request. To

sar -d 5 5

15:31:55 device %busy avque r+w/s blks/s avwait avserv

15:32:00 c0t6d0 81 3.4 31 248 59.31 21.20c0t5d0 5 .5 1 32 0.65 23.58

15:32:05 c0t6d0 84 3.5 34 245 71.64 24.04c0t5d0 3 .5 2 8 0.25 17.93

15:32:10 c0t6d0 68 2.9 31 248 51.36 18.55c0t5d0 1 .5 0 6 0.48 19.18

15:32:15 c0t6d0 71 2.7 30 30 62.88 24.16c0t5d0 0 .5 1 3 0.65 29.25

15:32:20 c0t6d0 69 2.7 29 29 61.70 24.14c0t5d0 0 .5 1 3 0.65 29.25

Example of Queuing Theory



1-15

see why the wait time is so high, look at the avque size. Notice the queue size is highest when the device is most busy. This is the basic concept of the performance queuing theory.



1–10. SLIDE: Summary

Student Notes To summarize this module, systems are tuned for response time or for throughput. This class focuses on tuning for best possible response time. Areas that affect response time are speed of the hardware, configuration of the operating system, and configuration of the application. This class focuses on the configuration of the operating system. Common bottlenecks with computer systems include CPU, memory, disk, and network. This class discusses all four bottlenecks. Baselines are an important measurement tool for quantifying performance. In the lab for this module, the student will establish CPU and disk I/O baselines. Finally, the queuing theory of performance states that the average response time increases as the average utilization of a resource increases. This is an important concept, which will be revisited throughout this course.

Summary

• Objective for the system: – Provide fast response time to users,

or– Maximize throughput of system

• Three performance problem areas:– Hardware– Operating System– Application

• Performance bottlenecks:– CPU– Disk– Memory– Network

• Need for baselines• Performance queuing theory



1-17

1–11. LAB: Establishing a Baseline

Directions

The following lab exercise establishes baselines for three CPU-bound applications and one disk-bound application. The objective is to time how long these applications take when there is no activity on the system. These same applications will be executed later on in the course when other bottleneck activity is present. The impact of these bottlenecks on user response time will be measured through these applications.

1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline

2. Compile three C programs long, med and short by running the BUILD script # ./BUILD

3. Time the execution of the long program. Make sure there is no activity on the system. # timex ./long Record Execution Time real: user: sys:

4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time real: user: sys:

5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time real: user: sys:

6. Time the execution of the diskread program. # timex ./diskread Record Execution Time real: user: sys:



7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why.



1-19

1–12. LAB: Verifying the Performance Queuing Theory

Directions

The performance queuing theory states that as the number of jobs in a queue increases, so will the response time of the jobs waiting to use that resource. This lab uses the short program compiled from /home/h4262/baseline/prime_short.c.

1. In terminal window 1, monitor the CPU queue with the sar command. # sar -q 5 200

2. In a second terminal window, time how long it takes for the short program to execute. # timex ./short & How long did the program take to execute? _________________ How does this compare to the baseline measurement from earlier? _____ What is the CPU queue size? _______

3. Time how long it takes for three short programs to execute. # timex ./short & timex ./short & timex ./short & How long did the slowest program take to execute? ___________________ How did the CPU queue size change from step 2? ___________________

4. Time how long it takes for five short programs to execute. # timex ./short & timex ./short & timex ./short & \ timex ./short & timex ./short & How long did the slowest program take to execute? _____________________ How did the CPU queue size change from step 3? _____________________

5. Is the relationship between elapsed execution (real) time and the number of running programs linear?

6. Comment about the overhead of switching from one process to another.


2-1


Objectives


• Identify various performance tools available on HP-UX.

• Categorize each tool as either real time or data collection.

• List the major features of the performance tools.

• Compare and contrast the differences between the tools.



2-2

2–1. SLIDE: HP-UX Performance Tools

Student Notes Many performance tools are available for many different purposes. In the HP-UX operating system, there are over 50 different performance-related tools. Some tools provide real-time performance information, such as, “How busy the CPU is right now?” Other tools collect data in the background and maintain a history of performance information. This module addresses all the tools and the different functions they perform.

HP-UX Performance Tools



2-3

2–2. SLIDE: HP-UX Performance Tools (Continued)

Student Notes The objective of this module is to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool is used. The module is intended to be a quick reference of performance tools, which the student can refer to when needing to select a tool for a specific task. NOTE: This module does not discuss how to interpret the output of the tools.

Interpretation of the metrics is provided in later modules.

HP-UX Performance Tools

Objective:• Identify the various performance tools available on HP-UX• Demonstrate their mechanics• Discuss their features• Compare and contrast the differences between the tools



2-4

2–3. SLIDE: Sources of Tools

Student Notes Three types of tools are presented in this module:

• Standard Tools Standard tools are those frequently found on many UNIX systems, including HP-UX. The advantage of the standard tools is that their results can be compared with those being collected on other UNIX platforms. This provides an "apples for apples" comparison, which is desirable when comparing systems. The output from these standard tools (and some of the options) may vary slightly among UNIX systems. In addition, differences between the various UNIX implementations can affect the reliability of the metrics being output by the tools. Therefore, be careful to check the results with other tools or seek help before basing important tuning decisions on the value of one metric.

• HP-Specific Tools HP-specific tools are those which are found only on HP-UX operating systems. These tools are often tailored specifically to understand HP-UX implementations. These tools are generally not found on other UNIX implementations, as other

Sources of Tools

• Standard tools– Tools found on UNIX systems, including HP-UX– Tools frequently found on other UNIX systems

• HP-UX-specific tools– Tools found only on HP-UX

• Optional tools– Tools licensed and sold separately

(Generally available only on HP-UX)



2-5

implementations are different from those of HP. Some of the HP-specific tools come with the base OS; others are purchased as optional tools.

• Optional Tools Optional tools are tools that are added to the operating system in addition to the standard tools. Some of the optional tools, such as the HP-PAK (Programmers Analysis Kit), may be included with add-on software, such as compilers for HP-UX. Other optional tools, like GlancePlus, PerfView, MeasureWare, NetMetrix, PRM (Process Resource Manager), and WLM (Work Load Manager), are purchased individually or in small bundles (GlancePlus Pak also includes a MeasureWare agent). Optional tools are typically licensed from HP. They offer many advantages over the standard tools including:

− ease of use

− accuracy

− granularity

− low overhead

− additional metrics



2-6

2–4. SLIDE: Types of Tools

Student Notes The tools covered in this section fall into six main categories:

• Run-Time Monitoring Tools These tools provide information as to the performance of the system now. The information is current and provides a real-time perspective as to the state of the system at the current moment.

• Data Collection Performance Tools These tools collect performance data in the background, summarize or average the data into a summary record, and log the summary record to a file or files on disk. They do not typically provide real-time data.

• Network Monitoring Tools These tools monitor performance, status, and packet errors on the network. They include both monitoring and configuration tools related to network management.

• Performance Administrative Tools A system administrator can use these tools to manage the performance of his system.

Types of Tools

PerformanceAdministration

Application Profilingand Monitoring

Network Monitoring

Data CollectionPerformance

Run-TimeMonitoring

System Configurationand Utilization



2-7

They typically do not report any data, but allow the current configuration of the system (and its components) to be changed to help improve performance

• System Configuration and Utilization Information Tools These tools report current system configurations (such as LVM and file systems). They also report utilization of resource statistics, like disk and file system space and number of processes.

• Application Profiling and Monitoring Tools These tools provide in-depth analysis about the behavior of a program. These tools monitor and trace the execution of a process, and report the resources and calls made during its execution.



2-8

2–5. SLIDE: Criteria for Comparing the Tools

Student Notes Each tool has strengths and weaknesses, advantages and disadvantages, and unique features. Some items to consider when selecting a tool are: Source of Data The collected data can come from a variety of sources, including the

kernel, an application, or a specific daemon (like the midaemon). Scope The scope determines the level of detail provided with the tool. Most of

the standard tools do not show process-level metrics. For example, they display global disk I/O rates, but do not show which process is generating the I/O or the disk on which the I/O is concentrated.

Cost The cost sometimes determines if the tool is an option. Many of the

HP-specific tools have additional costs associated with them. (Many of these tools have evaluation copies available for a trial period.)

Intrusiveness The intrusiveness relates to the overhead associated with running the

tool. Some tools also have significant overhead. A large user community using top, for example, may be responsible for generating large amounts of "monitoring" overhead on the system. Another example is the ps

Criteria for Comparing the Tools

• Source of data• Scope• Additional cost versus no cost• Intrusiveness• Accuracy• Ease of use• Portability• Metrics available• Data collection and storage• Permissions required



2-9

command. It has little impact on most systems due to the low frequency at which it is executed. However, the ps command places fairly high overhead on the system during its execution.

Accuracy The accuracy of the tool relates to the reliability of the data being

reported. Many standard UNIX tools, like vmstat and sar, have been ported from other UNIX systems. The registers that they monitor may not always correspond to the registers that the kernel updates.

Others There are other factors that can have significant impact on the tool you

decide to use. These factors include familiarity, metrics available, permissions required, and portability.

As the tools are presented in the upcoming pages, many of these items will be addressed.



2-10

2–6. SLIDE: Data Sources

Student Notes The standard tools read information from the UNIX counters and registers maintained in kernel memory (accessible via the /dev/kmem device file and the pstat() system call). These counters and registers are updated 10 times a second as a standard part of most UNIX system implementations. The data in the counters and registers are generally adequate for most performance jobs, but do not provide enough detail when in-depth tuning is needed. The optional tools for HP-UX use an additional source called kernel instrumentation (KI). The KI interface provides additional information beyond the UNIX kernel counters and registers. The KI interface gathers performance information on a system call basis, with every system call generated by every process being traced. The KI interface uses a proprietary measurement interface library to derive the additional metrics. These tools are frequently revised and updated to provide the highest levels of accuracy with the lowest possible overhead. The optional tools, such as Glance and MeasureWare, are KI-based tools when running on HP-UX systems, although they are available for other vendor systems as well. Additional information about KI-based tools (also known as resource and performance management (RPM) tools) can be obtained from the RPM web site at: www.hp.com/go/rpm

MeasurementInterfaceLibrary

Kernel Memory/dev/kmemor pstat()

Kernel InstrumentationTrace Buffers

midaemonsar

vmstat

iostat

psscopeux

logfiles

extractutility

SharedMemorySegment

glance

pv

Socket

Data Sources



2-11

2–7. SLIDE: Performance Monitoring Tools (Standard UNIX)

Student Notes The slide shows run-time performance monitoring tools included with HP-UX. These tools provide current information about the performance of the system. These tools are standard UNIX performance tools, which are found on most other UNIX implementations. The Global Metrics column indicates whether the tool will show aggregate resource utilization without differentiating between specific resources. The Process Detail column indicates whether the tool will show resources being used by a single PID. The Alarming Capability column indicates whether the tool is capable of sending an alarm when one of the metrics exceeds a user-defined threshold.

Performance Monitoring Tools (Standard UNIX)

NoNoYesvmstat

NoSomeSomeuptime,w

NoSomeYestop

NoSomeSometimex

NoSomeNotime

NoNoYessar

NoYesNops

NoNoYesiostat

Alarming CapabilityProcess DetailsGlobal Metrics



2-12

2–8. TEXT PAGE: iostat

The iostat command reports I/O statistics for each active disk on the system. Tool Source: Standard UNIX (BSD 4.x)

Documentation man page

Interval >= 1 second

Data Source: Kernel registers/counters

Type of Data: Global

Metrics: Physical Disk I/O

Logging: Standard output device

Overhead: Varies, depending on the output interval

Unique Features: Terminal I/O

Full Pathname: /usr/bin/iostat

Pros and Cons: + statistics by physical disk drive - limited statistics

- poorly documented and cryptic headings

Syntax

iostat [-t] [interval [count]] -t Report terminal statistics as well as disk statistics interval Display successive lines summaries at this frequency count Repeat the summaries this number of times

Key Metrics

The iostat metrics include: bps Blocks (kilobytes) transferred per second sps Number of seeks per second msps Average milliseconds per seek With the advent of new disk technologies, such as data striping, where a single data transfer is spread across several disks, the average milliseconds per seek becomes impossible to compute accurately. At best it is only an approximation, varying greatly, based on several dynamic system conditions. For this reason and to maintain backward compatibility, the milliseconds per seek (msps) field is set to the value 1.0.

Examples

# iostat 5 2 device bps sps msps c0t6d0 0 0.0 1.0 c0t6d0 1100 34.6 1.0



2-13

# iostat -t 5 1 tty cpu tin tout us ni sy id 0 0 2 0 1 98 device bps sps msps c0t6d0 0 0.0 1.0



2-14

2–9. TEXT PAGE: ps

The ps command displays information about selected processes running on the system. The command has many options for reducing the amount of output. Tool Source: Standard UNIX (BSD 4.x)

Documentation: man page

Interval: on demand

Data Source: in core process table

Type of Data: per process

Metrics: state, priority, nice values, PIDs, times, ...


Overhead Varies, depending on the number of processes

Unique Features: Wait channel and Run queue of processes.

Full Pathname: /usr/bin/ps

Pros and Cons: + familiarity + options for altering output - minimal information

- no averaging or summarization (i.e. no global metrics)

Syntax

ps [-aAcdefHjlP] [-C cmdlist] [-g grplist] [-G gidlist] [-n namelist] [-o format] [-R prmgrplist [-s sidlist] [-t termlist] [-u uidlist] [-U uidlist]

Key Metrics

The ps metrics include: ADDR The memory address of the process, if resident; otherwise, the disk address.

C Recent processor utilization, used for CPU scheduling (0-255).

F Flags associated with the process (octal, additive):

0 Process is on the swap device 1 Process is in core memory 2 Process is a system process 4 Process is locked in memory

(and many more)

NI The nice value for the process; used in priority computation.

PPID The process ID number of the parent process.

PID The process ID number of this process.



2-15

PRI The priority of the process.

S The state of the process

I Process is being created (very rarely seen) S Process is Sleeping R Process is currently Runnable T Process is Stopped (rare) Z Process is terminated (aka Zombie process)

STIME Starting time of process.

SZ The size in 4-KB memory pages.

TIME The cumulative execution time of the process.

TTY The controlling terminal for the process.

WCHAN The address of a structure representing the event or resource for which the

process is waiting or sleeping.

Example # ps -fu daemon UID PID PPID C STIME TTY TIME COMMAND daemon 1171 1170 0 13:03:42 ? 3:10 /usr/bin/X11/X :0 daemon 1565 1171 0 17:47:47 ? 0:00 pexd /tmp/to_pexd_1171.2 /dev/ttyp2 # ps -lu daemon F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 S 1 1171 1170 1 154 20 dbea00 697 3ace9c ? 3:10 X 1 S 1 1565 1171 0 154 20 10e6900 115 3ace9c ? 0:00 pexd



2-16

2–10. TEXT PAGE: sar

The sar command collects and reports on many different system activities (and system areas), including CPU, buffer cache, disk, and others. Related commands include sadc, sa1, and sa2. These commands are related to the data collection functionality of sar and will be addressed with the data collection commands. Tool Source: Standard UNIX (System V)

Documentation: man page and kernel source

Interval: >= 1 second

Data Source: /dev/kmem registers/counters


Metrics: CPU, Disk, and Kernel resources

Logging Standard output device, or file on disk

Overhead: Varies, depending on the output interval

Unique Features: Disk I/O wait time, kernel table overflows, buffer cache hit ratios

Full Pathname: /usr/sbin/sar

Pros and Cons: + familiarity + performs both real time and data collection functions - no per process information - no paging information, only designed for swapping (no longer done on HP-UX)

Syntax

sar [-ubdycwaqvmpAMSP] [-o file] t [n] Metric-related options: -u CPU Utilization -q Run queue and swap queue lengths and utilization -b Buffer cache stats -d Disk utilization -y TTY utilization -c System call rates -w Swap activity -v Kernel table utilization -m Semaphore and message queue utilization -a File access system routine utilization -A Everything! -M Per processor breakdown (used with –u and/or –q) -P/p Per processor set breakdown (used with –MU and/or –Mq)



2-17

Key Metrics

The sar command has many metrics. Included below are some sample metrics based on the disk and CPU reports:

CPU Report (-u)

The CPU report displays utilization of CPU and the percentage of time spent within the different modes. %usr Percentage of time system spent in user mode %sys Percentage of time system spent in system mode %wio Percentage of time processes were waiting for (disk) I/O %idle Percentage of time system was idle

Disk Report (-d)

The disk report displays activity on each block device (i.e. disk drive). Device Logical name of the device (device file name) %busy Percentage of time the device was busy servicing a request avque Average number of I/O requests pending for the device r+w/s Number of I/O requests per second (includes reads and writes) blks/s Number of 512-byte blocks transferred (to and from) per second avwait The average amount of time the I/O requests wait in the queue before being

serviced avserv The average amount of time spent servicing an I/O request (includes seek,

rotational latency, and data transfer times)

Examples

# sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 10/14/97 08:32:24 %usr %sys %wio %idle 08:32:29 64 36 0 0 08:32:34 61 39 0 0 08:32:39 61 39 0 0 08:32:44 61 39 0 0 Average 61 39 0 0 # sar -d 5 4 HP-UX r3w14 B.10.20 C 9000/712 10/14/97 08:32:24 device %busy avque r+w/s blks/s avwait avserv 08:32:29 c0t6d0 19.36 0.55 20 1341 6.37 14.27 08:32:34 c0t6d0 26.40 0.58 27 1687 7.10 15.00 08:32:39 c0t6d0 21.00 0.54 23 1528 5.48 14.09 08:32:44 c0t6d0 21.00 0.54 23 1528 5.48 14.09 Average c0t6d0 22.44 0.56 23 1552 6.34 14.45



2-18

2–11. TEXT PAGE: time, timex

Description

The time and timex commands report the elapsed (wall clock) time, the time spent in system mode, and the time spent in user mode, for a specific invocation of a program. The timex command is an enhanced version of time, and can report additional statistics related to resources used during the execution of the command. Tool Source: Standard UNIX (System V)

Documentation: man page and kernel source

Interval: Process completion


Type of Data: Process

Metrics: CPU (user, system, elapsed)


Overhead: Minimal

Unique Feature: Timing how long a process executes

Full Pathname: /usr/bin/timex

Pros and Cons: + minimal overhead - cannot be used on already running processes

Syntax

time command timex [-o] [-p[fhkmrt]] [-s] command -o List amount of I/O performed by command (requires pacct file to be present) -s List activity (SAR data) present during execution of command (requires sar file to be present)

Example timex find / 2>&1 >/dev/null | tee -a perf.data real 39.49 user 1.47 sys 11.24



2-19

2–12. TEXT PAGE: top

Description

The top command displays a real-time list of the CPU consumers (processes) on the system, sorted, with the greatest users at the top of the list. Tool Source: Standard UNIX (BSD 4.x)




Type of Data: Global, Process

Metrics: CPU, Memory


Overhead: Varies, depending on presentation interval

Unique Feature: Real-time list of top CPU consumers

Full Pathname: /usr/bin/top

Pros and Cons: + quick look at global and process CPU data - limited statistics - uses curses for terminal output

Syntax

top [-s time] [-d count] [-n number] [-q] -s time Set the delay between screen updates -d count Set the number of screen updates to "count", then exit -n number Set the number of processes to be displayed -q Run quick. The top command with a nice value of zero.

Key Metrics

The top metrics include: SIZE Total size of the process in KB. This includes text, data, and stack. RES Resident size of the process in KB. This includes text, data, and stack. %WCPU Average (weighted) CPU usage since top started. %CPU Current CPU usage over the current interval.

Example * Start top with a 10 second update interval # top -s 10 * Start top and display only 5 screen updates then exit # top -d 5 * Start top and display only top 15 processes # top -n 15



2-20

* Start top and let it run continuously # top System: r3w14 Fri Oct 17 10:24:23 1997 Load averages: 0.55, 0.37, 0.25 115 processes: 113 sleeping, 2 running Cpu states:LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS 0.55 9.9% 0.0% 2.0% 88.1% 0.0% 0.0% 0.0% 0.0% Memory: 24204K (15084K) real, 46308K (33432K) virtual, 2264K free Page# 1/9 TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND ? 680 root 154 20 1328K 468K sleep 33:23 12.36 12.34 snmpdm ? 728 root 154 20 340K 136K sleep 18:20 5.82 5.81 mib2agt ? 1141 root 154 20 12784K 3708K sleep 84:06 4.47 4.47 netmon ? 1071 root 80 20 1264K 568K run 0:19 3.00 2.99 pmd ? 3892 root 179 20 308K 296K run 0:00 2.59 0.34 top * To go to the next/previous page, type "j" and "k" respectively * To go to the first page, type "t"

NOTE: The two values preceding real and virtual memory are the memory allocated

for all processes, and in parentheses, memory allocated for processes that are currently runnable or that have executed within the last 20 seconds.

NOTE: swait and block are relevant for SMP systems and will be 0.0% on single

processor systems. swait is the time a processor spends “spinning” while waiting for a spinlock. block is the time a processor spends “blocked” while waiting for a kernel-level semaphore.



2-21

2–13. TEXT PAGE: uptime, w

The uptime command shows how long a system has been up, and who is logged in and what they are doing. The w command is linked to uptime and prints the same output as uptime -w, displaying a summary of the current activity on the system. Tool Source: Standard UNIX (BSD 4.x)


Interval: on demand

Data Source: Kernel registers/counters and /etc/utmp


Metrics: Load averages, number of logged on users


Overhead: Varies, depending on number of users logged in

Unique Feature: Easiest way to see time since last reboot, load averages

Full Pathname: /usr/bin/uptime

Pros and Cons: + quick look at load average, how long systems been up - limited statistics

Syntax

uptime [-hlsuw] [user] w [-hlsuw] [user] -h Suppress the first line and the header line -l Print long listing -s Print short listing -u Print only the utilization lines; do not show user information -w Print what each user is doing; same as w command.

Example # uptime 11:23am up 3 days, 22:22, 7 users, load average: 0.62, 0.37, 0.30 # uptime -w 11:23am up 3 days, 22:22, 7 users, load average: 0.57, 0.37, 0.30 User tty login@ idle JCPU PCPU what root console 9:26am 94:20 /usr/sbin/getty console root pts/0 9:26am 5 /sbin/sh root pts/3 9:26am 1:57 /sbin/sh root pts/4 10:16am 2 2 vi tools_notes root pts/5 9:43am script



2-22

2–14. TEXT PAGE: vmstat

The vmstat command reports virtual memory statistics about processes, virtual memory, and CPU activity. Tool Source: Standard UNIX (BSD 4.x)

Documentation man page, include files




Metrics: CPU, Memory


Overhead: Varies, depending on presentation interval

Unique Feature: Cumulative VM statistics since last reboot

Full Pathname: /usr/bin/vmstat

Pros and Cons: + minimal overhead - poorly documented - cryptic headings - lines wrap on 80-column character display - statistics can bleed together

Syntax

vmstat [-dnS] [interval [count]] vmstat -f | -s | -z

-d Include disk I/O information -n Print in a format more easily viewed on a 80-column display -S Include swapping information -f Print number of processes forked since boot, number of pages used by all forked processes, and the average pages/forked process -s Print virtual memory summary information -z Zero the summary registers.

Key Metrics

The vmstat metrics include:

Process metrics

r In run queue b Blocked for resource (I/O, paging, and so on) w Runnable or short sleeper (< 20 sec.) but swapped



2-23

VM metrics

avm Active virtual pages free Number of pages on the free list re Page reclaims at Address translation faults pi Pages paged in po Pages paged out fr Pages freed by vhand, per second sr Pages surveyed (dereferenced) by vhand, per second

Fault metrics

in Device interrupts per second sy System calls per second cs CPU context switch rate (switches/second)

CPU metrics

us User mode utilization sy System mode utilization id Idle time

Examples # vmstat -n 5 2 VM memory page faults avm free re at pi po fr de sr in sy cs 7589 728 0 0 0 0 0 0 0 140 490 30 CPU cpu procs us sy id r b w 2 1 97 0 74 0 7670 692 0 0 0 0 0 0 0 235 4959 170 47 11 42 0 75 0 # vmstat -nS 5 2 VM memory page faults avm free si so pi po fr de sr in sy cs 7984 584 0 0 0 0 0 0 0 140 490 30 CPU cpu procs us sy id r b w 2 1 97 0 75 0 7972 549 0 0 0 0 0 0 0 203 462 53 1 1 98 0 76 0 # vmstat -f 3949 forks, 497929 pages, average= 126.09



2-24

# vmstat -s 0 swap ins 0 swap outs 0 pages swapped in 0 pages swapped out 1116471 total address trans. faults taken 346175 page ins 7976 page outs 200675 pages paged in 16824 pages paged out 213104 reclaims from free list 216129 total page reclaims 110 intransit blocking page faults 587961 zero fill pages created 303212 zero fill page faults 248573 executable fill pages created 67077 executable fill page faults 0 swap text pages found in free list 80233 inode text pages found in free list 166 revolutions of the clock hand 106769 pages scanned for page out 13236 pages freed by the clock daemon 75633551 cpu context switches 1612387244 device interrupts 1137948 traps 247228805 system calls



2-25

2–15. SLIDE: Performance Monitoring Tools (HP Specific)

Student Notes This slide shows the HP-specific, run-time performance monitoring tools included with HP-UX. Currently, glance and gpm are available for HP-UX. Both glance and gpm are optional, and can be purchased separately. If you are running 11i (any version), both glance and gpm are included with the Enterprise and Mission Critical Operating Environments. The glance and gpm tools provide real-time monitoring capabilities specific to the HP-UX operating system. Both tools provide access to performance data not available with standard UNIX tools, and both tools use the midaemon (i.e. KI interface) to collect performance data, yielding much more accurate performance results. xload is an X-windows application, which graphically shows the recent length of the CPU’s run queue. It consists of a window that displays vertical lines which represent the average number of processes in the run queue over the previous intervals. The default interval size is 8 seconds.

Performance Monitoring Tools(HP Specific)

NoNoYesxload

YesYesYesgpm

YesYesYesglance




2-26

2–16. TEXT: glance

The glance tool is available for HP-UX. This is the recommended (and preferred) performance monitoring tool for HP-UX systems (character-based display). This tool shows information that cannot be seen with any of the standard UNIX monitoring tools. The accuracy of the data is considered more reliable, as the source is the midaemon, as opposed to the kernel counters and registers. NOTE: Free evaluation copies of glance and gpm can be obtained for trial periods.

The phone number to obtain an evaluation copy is (800)237-3990. Tool Source: HP

Documentation: man page and on-line help


Data Source: midaemon

Type of Data: Global, Process, and Application

Metrics: CPU, Memory, Disk, Network, and Kernel resources

Logging: Standard output device, screen shots to a file

Overhead: Varies, depending on presentation interval and number of processes

Unique Feature: Per process (and global) system call rates Extensive on-line help for the metrics Sort by CPU usage, memory usage, or disk I/O usage Files opened per process

Full Pathname: /opt/perf/bin/glance

Pros and Cons: + extensive per-process information + extensive global information + more accurate than standard UNIX tools - uses the “curses” display library - relatively slow startup - not bundled with the OS (prior to 11i)

Syntax

glance [-j interval] [-p [dest]] [-f dest] [-maxpages numpages] [-command] [-nice nicevalue] [-nosort] [-lock] [-adviser_off] [-adviser_only] [-bootup] [-iterations count] [-syntax filename] [-all_trans] [-all_instances] [-disks <n>] [-kernel <path>] [-nfs <n>] [-pids <n>] [-no_fkeys]



2-27

Key Metrics

The glance tool includes reports for the following areas: Hot Key GLANCE PLUS REPORT FUNCTION a CPU by Processor All CPUs Performance Stats c CPU Report CPU Utilization Stats d Disk Report Disk I/O Stats g Process List Global Process Stats h Help i I/O by Filesystem I/O by Filesystem l Network by LAN Lan Stats m Memory Report Memory Stats n NFS Report NFS Stats s Process selection Single process information t System Table Report OS Table Utilization u Disk Report Disk Queue Length v I/O by Logical Volume Logical Volume Mgr Stats w Swap Detail Swap Stats z Zero all Stats A Application List B Global Waits D DCE Activity F Process Open Files G Process Threads H Alarm History I Thread Resource J Thread Wait K DCE Process List L Process System Calls M Process Memory Regions N NFS Global Activity P PRM Group List R Process Resources T Transaction Tracker W Process Wait States Y Global System Calls Z Global Threads ? Help with options <CR> Update screen with new data

See Module 3 for a more complete discussion of glance and gpm.



2-28

2–17. TEXT PAGE: gpm

The gpm tool is a graphical version of glance. All the benefits of using glance apply to gpm (GlancePlus Monitor). NOTE: Free evaluation copies of Glance and gpm can be obtained for a 90-day trial

period. The phone number to obtain the evaluation copy is (800)237-3990. Tool Source: HP

Documentation: man page and on-line help



Type of Data: Global, Process, Application

Metrics: CPU, Memory, Disk, Network, kernel resources

Logging: Standard output device and screen shots to a file

Overhead: Varies, depending on presentation interval and number of processes

Unique Feature: Alarming capabilities Performance advisor

Full Pathname: /opt/perf/bin/gpm

Pros and Cons: + extensive per-process information + extensive global information + more accurate than standard UNIX tools - no selection for printing graphs - not bundled with the OS, prior to 11i

Syntax

gpm [-nosave] [-rpt [rptname]] [-sharedclr] [-nice nicevalue] [-lock] [-disks <n>] [-kernel <path>] [-lfs <n>] [-nfs <n>] [-pids <n>] [Xoptions]

Glance and GPM Advantages

Both Glance and GPM:

• Use the same metrics

• Use the midaemon and kernel registers/counters as data sources

• Have adjustable presentation intervals

• Have the ability to renice processes

• Provide alarming capability (via /var/opt/perf/advisor.syntax)

• Provide per-CPU metrics



2-29

• Can be configured to monitor application performance (that is, groups of processes)

Glance Advantages

Advantages of using Glance include:

• It is independent of X-Windows.

• It uses less overhead.

GPM Advantages

Advantages of using gpm include:

• It has customizable advisor syntax, which generates color-coded alarms.

• Has the ability to kill processes

• Reports are customizable.

• More comprehensive online documentation is available.

See Module 3 for a more complete discussion of glance and gpm.



2-30

2–18. TEXT PAGE: xload

xload is a graphical tool that will display the average length of the run queue over recent 10 second intervals. Since it is displayed in its own window on a graphics terminal, the window can be resized to accommodate good detail and many intervals at once. Tool Source: HP


Interval: 10 seconds (default)

Data Source: Kernel registers


Metrics: Run queue length

Logging: none

Overhead: Very little

Unique Feature: Visual representation of run queue lengths

Full Pathname: /usr/contrib./bin/X11/xload

Pros and Cons: + visual representation of run queue lengths in real time + expandable window for greater time and detail + self-scaling - no scale labels - no per-processor information

Syntax

xload [-toolkitoption … ] [-scale integer] [-update seconds] [-hl|-highlight color] [-jumpscroll pixels] [-label string] [-nolabel] [-lights]

Example

xload -update 30



2-31

2–19. SLIDE: Data Collection Performance Tools (Standard UNIX)

Student Notes This slide shows the standard UNIX data collection tools included with HP-UX. Data collection tools gather performance data and other system-activity information, and store this data to a file on the system. By default, not too many standard UNIX tools perform data collection. The two most common tools are the acct (system accounting) suite of tools and sar (via the sadc and sa1 programs), the system activity reporter.

Data Collection Performance Tools (Standard UNIX)

NoNoYessar

NoSomeSomeacctcom




2-32

2–20. TEXT PAGE: acct Programs

The system accounting programs are primarily a financial tool and are designed to charge for time and resources used on the system. Information such as connect time, pages printed, disk space used for file storage, and commands executed (and the resources used by those commands) is collected and stored by the acct commands. Generally not considered a performance tool, the accounting commands can provide useful data for certain situations.

Description

Tool Source: Standard UNIX (System V)

Documentation: man pages

Interval: on demand

Data Source: Kernel registers and other kernel routines

Type of Data: System resources used, on a per user basis

Metrics: Connect time, Disk space used, others

Logging: Binary file /var/adm/acct/pacct

Overhead: Medium to large (up to 33%), depending on number of users and amount of activity

Unique Feature: Shows the amount of system resources being consumed by each user on the system.

Logs every command executed by every user on the system.

Full Pathname: /usr/sbin/acct/[acct_command]

Pros and Cons: + provides information to charge users for system use + extensive system utilization information kept - extremely large overhead, especially on an active system. - poor documentation

Syntax

/usr/sbin/acct/acctdisk /usr/sbin/acct/acctdusg [-u file] [-p file] /usr/sbin/acct/accton [file] /usr/sbin/acct/acctwtmp reason /usr/sbin/acct/closewtmp /usr/sbin/acct/utmp2wtmp and many more …



2-33

System Accounting Notes

• System Accounting can be started:

Manually Run the /usr/sbin/acct/startup command. Automatically at Boot Time Edit the /etc/rc.config.d/acct file and set

the START_ACCT parameter equal to one (for example, START_ACCT=1).

• Only terminated processes are reported.

• Accounting reports include:

− CPU time accounting

− Disk accounting

− Memory accounting

− Connect time accounting

− User command history

− Several more



2-34

2–21. TEXT PAGE: sar

The sar tool comes with additional programs, which assist in performance data collection and storage. The performance data is kept for one month before being overwritten with new data. Since collected data is overwritten each month, monitoring the files’ sizes is unnecessary. The sadc program is a data collector which runs in the background, usually started by sar or sa1. The sa1 program is a convenient shell script for collecting and storing “sar” data to a log file under /var/adm/sa. This script is typically run from root's cron file and collects (by default) three system snapshots per hour. The sa2 program is also a convenient shell script for converting collected sar data (binary format) into readable ASCII report files. The report files are typically stored in /var/adm/sa. The sa2 script is also normally run from root's cron file. Tool Source: Standard UNIX (System V)





Metrics: CPU, Disk, Kernel resources

Logging: Binary file under /var/adm/sa

Overhead: Varies, depending on snapshot interval

Unique Feature: Only standard UNIX performance data collector

Full Pathname: /usr/sbin/sar

Pros and Cons: + familiarity + relatively low overhead - no per process information - accuracy not as good as MeasureWare/OVPA

Syntax

sar [-ubdycwaqvmAMS] [-o file] t [n] sar [-ubdycwaqvmAMS] [-s time] [-e time] [-i sec] [-f file]

Some data collection related options:

-s The start time of the desired data -e The end time of the desired data -i The size of the reporting interval in seconds -o The file to write the data to -f The file to read the data from



2-35

Configure Data Collection through cron Jobs

To set up sar data collection, add the following to root's cron file: 0 * * * 0,6 /usr/lbin/sa/sa1 0 8-17 * * 1-5 /usr/lbin/sa/sa1 1200 3 0 18-7 * * 1-5 /usr/lbin/sa/sa1 5 18 * * 1-5 /usr/lbin/sa/sa2 -s 8:00 -e 18:01 -i 3600 -u 5 18 * * 1-5 /usr/lbin/sa/sa2 -s 8:00 -e 18:01 -i 3600 -b 5 18 * * 1-5 /usr/lbin/sa/sa2 -s 8:00 -e 18:01 -i 3600 -q

Create the /var/adm/sa directory:

mkdir /var/adm/sa

Some systems recommend adding the above entries to adm's cron file instead of root's. On these systems, be sure to give write access to all users on the /var/adm/sa directory.

chmod a+w /var/adm/sa



2-36

2–22. SLIDE: Data Collection Performance Tools (HP-Specific)

Student Notes This slide shows the HP-specific data collection performance tools, which can be added to an HP-UX system. The MeasureWare/OVPA (OpenView Performance Agent) and PerfView/OVPM (OpenView Performance Manager) tools are available for HP-UX systems. These tools are optional products (separately purchasable). These tools significantly enhance a customer's ability to track performance trends and review historical performance data about a system. The standard UNIX tools collect little to no per-process information, and have no alarming capabilities. With the MeasureWare/OVPA and PerfView/OVPM tools, global and per-process information is collected. In addition, alarms can be set to notify a user when a collected metric exceeds a defined threshold. Recently, PerfView was renamed OpenView Performance Manager and MeasureWare was renamed OpenView Performance Agent. There were no other significant changes made to the products.

Data Collection Performance Tools(HP-Specific)

User DefinableUser DefinableUser DefinableData Source Integration

YesYesYesPerfView/OVPM

YesYesYesMeasureWare/OVPA




2-37

2–23. TEXT PAGE: MeasureWare/OVPA and DSI Software MeasureWare/OVPA is the recommended and preferred tool for collecting performance data on an HP-UX system. MeasureWare/OVPA collects all the global and process statistics, consolidates the data into a 5-minute summary, and writes the record to a circular log file. Processes can be grouped into applications, and various thresholds are available for determining which processes are included in the summary. OVPA version 3.x is identical to Measureware. OVPA version 4.x serves the same purpose, but has a new user interface. Included with MeasureWare/OVPA is a product/tool called Data Source Integration (DSI). DSI allows custom, application-specific metrics to be defined and collected via the MeasureWare/OVPA product. This custom information can include database statistics, networking statistics collected with NetMetrix, or MIB information from a networking device (router or gateway) collected with SNMP. Tool Source: HP

Documentation: man pages, manual, on-line help

Interval: 1 minute and 5 minute summaries


Type of Data: Global, Process, Application

Metrics: CPU, Memory, Disk, Network, Other

Logging: Circular binary files under /var/opt/perf/datafiles

Overhead: Number of processes and number of application definitions

Unique Feature: Parameter file to define the extent of data collection. Circular, compact log file format

Full Pathname: /opt/perf/bin/mwa

Pros and Cons: + extensive global information + extensive per-process information + customize data collection with DSI - requires another tool (PerfView/OVPM) for graphical analysis - not included with the base OS

Syntax

mwa [action] [subsystem] [parms] in which action is start Start all or part of MeasureWare/OpenView Performance Agent. (default) stop Stop all or part of MeasureWare/OpenView Performance Agent. restart Reinitialize all or part of MeasureWare/OpenView Performance Agent. This

option causes some processes to be stopped and restarted.



2-38

status List the status of all or part of MeasureWare/OpenView Performance Agent

processes.

MeasureWare/OVPA and Data Source Integration Notes

• The MeasureWare/OVPA agent for HP-UX is part of the RPM (Resource and Performance Management) set of performance tools. To find the complete list of available RPM products, visit the RPM Web site at: www.hp.com/go/rpm

• MeasureWare/OVPA is designed for use with the PerfView/OVPM Analyzer tool and features extensive alarming syntax.

• The utility and extract programs for MeasureWare/OVPA provide many features for the analysis and management of the MeasureWare/OVPA log files.

• The MeasureWare/OVPA agent is fully integrated with the OpenView product line and is capable of sending alarm messages to the PerfView/OVPM Monitor, Network Node Manager, and IT Operations.

• The MeasureWare/OVPA agent is available for a large number of UNIX platforms including: AIX, Solaris, NCR System VR4, Microsoft Windows NT, and more.

• Data Source Integration (DSI) is one of the most powerful features of MeasureWare/OVPA. DSI provides the ability to log data from any data source – as long as it writes its output to stdout.

• HP sells additional agents, which make use of this data source integration to allow for the monitoring of databases, network operating systems (for example, Windows NT and NetWare), and the Network Response Monitoring metrics (a facility of NetMetrix).

• Data can be imported from such operating environments as SAP/R3 and Baan.

See the course B5136S – “Performance Management with HP OpenView” for a more complete discussion of MeasureWare/OVPA.



2-39

2–24. TEXT PAGE: PerfView/OVPM The PerfView/OVPM tool allows collected MeasureWare/OVPA information to be viewed in a feature-rich, GUI interface. Graphs, charts, alarms, and other details are easily viewed with the PerfView/OVPM tool. Similarly to the MeasureWare product, OVPM version 3 is identical to PerfView, whereas OVPM version 4 has the same functionality but has a new user interface. Tool Source: HP

Documentation: Man pages, manual, online help

Interval: On demand

Data Source: MeasureWare/OVPA log files

Type of Data: Global, Process, and Application

Metrics: CPU, Memory, Disk, Network, others

Logging: To central monitoring workstation

Overhead: number of systems being analyzed, number of systems sending alarms

Unique Feature: Many predefined graph templates. Access to any system currently running the MeasureWare/OVPA agent.

Full Pathname: /opt/perf/bin/pv

Pros and Cons: + Centralized and automated performance monitoring + Can view data from DSI sources + Graphs can be saved in a worksheet format - Does not come standard with the OS

Syntax

pv [options]

PerfView/OVPM Notes

There are three components that make up the PerfView/OVPM product:

PerfView/OVPM Analyzer

• The PerfView/OVPM Analyzer allows for the performance administrator to easily access data from any MeasureWare/OVPA Agent.

• By default, the last 8 days of data are pulled in to be analyzed, but any amount of data that has been collected can be retrieved.

• The PerfView/OVPM Analyzer allows you to compare multiple systems against a specific metric as well for load balancing.

• The graphs produced by the PerfView/OVPM Analyzer can be stored, or printed out to any Postscript or PCL printer.

• As with all of the RPM products, the PerfView/OVPM Analyzer is fully integrated with Network Node Manager and IT Operations.



2-40

PerfView/OVPM Monitor

• The PerfView/OVPM Monitor receives alarms sent by MeasureWare/OVPA agents.

• It allows you to filter alarms by severity and type.

• The PerfView/OVPM Monitor is an optional module and may not be required if you are also running Network Node Manager or IT Operations.

PerfView/OVPM Planner

• The PerfView/OVPM Planner allows you to use collected MeasureWare/OVPA data to see performance trends.

• The more data provided to the PerfView/OVPM Planner and the less time you project it, the more accurate the reports will be.

• The PerfView/OVPM Planner is not a true capacity-planning tool in that it does not provide modeling or simulation capability.

See the course B5136S – “Performance Management with HP OpenView” for a more complete discussion of PerfView/OVPM.



2-41

2–25. SLIDE: Network Performance Tools (Standard UNIX)

Student Notes This slide shows the standard UNIX networking performance tools included with HP-UX. Networking performance tools monitor performance and errors on the network.

The standard UNIX networking tools primarily allow for monitoring of performance. The HP-specific tools will introduce the ability to tune some networking parameters to better meet the needs of a system's networking environment. NOTE: Super user (or root) access is not needed to monitor networking status by

default.

Network Performance Tools (Standard UNIX)

NoTest Network Connectivity and Packet Round-Trip Response Time

ping

NoNetwork File Sharing Statisticsnfsstat

NoVarious LAN Statisticsnetstat

Super User Access Required

Resource



2-42

2–26. TEXT PAGE: netstat

The netstat command displays general networking statistics. Information displayed includes:

• active sockets per protocol • network data structures (like route tables) • LAN card configuration and traffic Tool Source: Standard UNIX (BSD 4.x)

Documentation: man pages and manual

Interval: on demand

Data Source: Kernel registers and LAN card


Metrics: Network, LAN I/O, Sockets


Overhead: Varies, depending on network activity

Unique Features: Shows established and listening sockets. Shows traffic going through LAN interface card. Shows amount of memory allocated to networking

Full Pathname: /usr/bin/netstat

Pros and Cons: + provides lots of information on networking configuration - provides lots of metrics; not all metrics are documented well

Syntax

netstat [-aAn] [-f address-family] [system [core]] netstat [-f address-family] [-p protocol] [system [core]] netstat [-gin] [-I interface] [interval] [system [core]]

Examples

Display network connections # netstat -n Active Internet connections Proto Recv-Q Send-Q Local Address Foreign Address (state) tcp 0 0 156.153.192.171.1128 156.153.192.171.1129 ESTABLISHED tcp 0 0 156.153.192.171.1129 156.153.192.171.1128 ESTABLISHED tcp 0 0 156.153.192.171.947 156.153.192.171.1105 ESTABLISHED Active UNIX domain sockets Address Type Recv-Q Send-Q Inode Conn Refs Nextref Addr c6f300 dgram 0 0 844afc 0 0 0 /var/tmp/psb_front_socket c87e00 dgram 0 0 844c4c 0 0 0 /var/tmp/psb_back_socket de4f00 stream 0 0 0 f75240 0 0 f71200 stream 0 0 0 f75280 0 0 /var/spool/sockets/X11/0

Display network interface information: # netstat -in Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ni0* 0 none none 0 0 0 0 0 ni1* 0 none none 0 0 0 0 0



2-43

lo0 4608 127 127.0.0.1 6745 0 6745 0 0 lan0 1500 156.153.192.0 156.153.192.171 156 0 0 0 0

Display network interface traffic:

# netstat -I lan0 5 (lan0)-> input output (Total)-> input output packets packets packets packets 188 172 6973 6785 2 1 2 1 . . .

Display protocol status:

# netstat -s tcp: 2244 packets sent 1191 data packets (217208 bytes) 4 data packets (5840 bytes) retransmitted 692 ack-only packets (276 delayed) 318 control packets 2277 packets received 1288 acks (for 195140 bytes) 144 duplicate acks 1360 packets (236775 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 83 out-of-order packets (0 bytes) 0 discarded for bad header offset fields 0 discarded because packet too short 134 connection requests 120 connection accepts 243 connections established (including accepts) udp: 0 bad checksums 164 socket overflows 0 data discards ip: 460730 total packets received 0 bad header checksums 0 with ip version unsupported 2253 fragments received 2670 packets not forwardable 0 redirects sent icmp: 1989 calls to generate an ICMP error message Output histogram: echo reply: 727 destination unreachable: 1989 727 responses sent arp: 0 Bad packet lengths 0 Bad headers probe: 0 Packets with missing sequence number 0 Memory allocations failed igmp: 0 messages received with bad checksum 10939700 membership queries received 10969833 membership queries received with incorrect field(s) 0 membership reports received



2-44

2–27. TEXT PAGE: nfsstat

The nfsstat command displays network file system (NFS) statistics. Categories of NFS information include:

• server statistics

• client statistics

• RPC statistics

• performance detail statistics

Tool Source: Sun Microsystems


Interval: on demand



Metrics: NFS, RPC


Overhead: Varies, depending on NFS activity

Unique Feature: Shows RPC calls, retransmissions, and timeouts.

Full Pathname: /usr/bin/nfsstat

Pros and Cons: + reports both client and server activity - limited documentation

Syntax

nfsstat [ -cmnrsz ]

Examples

To reset all nfsstat counters to zero:

# nfsstat -z

To display server/client RPC and NFS statistics:

# nfsstat (this defaults to nfsstat -cnrs) Server rpc: Connection oriented: calls badcalls nullrecv badlen xdrcall dupchecks dupreqs 0 0 0 0 0 0 0 Connectionless oriented: calls badcalls nullrecv badlen xdrcall dupchecks dupreqs 0 0 0 0 0 0 0



2-45

Server nfs: calls badcalls 0 0 Version 2: (0 calls) null getattr setattr root lookup readlink read 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% wrcache write create remove rename link symlink 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% mkdir rmdir readdir statfs 0.0% 0.0% 0.0% 0.0% Version 3: (0 calls) null getattr setattr lookup access readlink read 0.0% 0.0% 0.0% 0.0% 0.0% 0 0% 0.0% write create mkdir symlink mknod remove rmdir 0.0% 0.0% 0.0% 0.0% 0.0% 0 0% 0.0% rename link readdir readdir+ fsstat fsinfo pathconf 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% commit 0.0% Client rpc: Connection oriented: calls badcalls badxids timeouts newcreds 20 0 0 0 0 badverfs timers cantconn nomem interrupts 0 17 0 0 0 Connectionless oriented: calls badcalls retrans badxids timeouts waits newcreds 20 0 0 0 0 0 0 badverfs timers toobig nomem cantsend bufulocks 0 17 0 0 0 0 Client nfs: calls badcalls clgets cltoomany 20 0 20 0 Version 2: (20 calls) null getattr setattr root lookup readlink read 0.0% 18.90% 0.0% 0.0% 0.0% 0.0% 0.0% wrcache write create remove rename link symlink 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% mkdir rmdir readdir statfs 0.0% 0.0% 1.5% 1.5% Version 3: (0 calls) null getattr setattr lookup access readlink read 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% write create mkdir symlink mknod remove rmdir 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% rename link readdir readdir+ fsstat fsinfo pathconf 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% commit 0.0%



2-46

2–28. TEXT PAGE: ping

The ping command sends an ICMP echo packet to a host, and times how long it takes for the echo packet to return. This command is often used to test connectivity to another system. Specific details of the implementation include: • An ICMP echo packet is sent once a second.

• Upon receipt of the echo packet, the round-trip time is displayed.

• The ability to display (via the -o option) the IP route taken.

Tool Source: Public Domain


Interval: on demand

Data Source: NIC and ICMP packets

Type of Data: Network

Metrics: Packet transmission


Overhead: minimal; one packet transmission per second

Unique Feature: Shows round-trip times between systems Shows route taken to and from the second system.

Full Pathname: /usr/sbin/ping

Pros and Cons: + familiarity + understood by all UNIX-based (and TCP/IP-based) systems - limited functionality

Syntax

ping [-oprv] [-i address] [-t ttl] host [-n count]

Examples

Send two ICMP echo packets to host star1:

# ping star1 -n 2 PING star1: 64 byte packets 64 bytes from 156.153.193.1: icmp_seq=0. time=1. ms 64 bytes from 156.153.193.1: icmp_seq=1. time=0. ms ----star1 PING Statistics---- 2 packets transmitted, 2 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/1



2-47

Send one ICMP packet and display the IP path taken:

# ping -o 156.152.16.10 -n 1 PING 156.152.16.10: 64 byte packets 64 bytes from 156.152.16.10: icmp_seq=0. time=337. ms ----156.152.16.10 PING Statistics---- 1 packets transmitted, 1 packets received, 0% packet loss round-trip (ms) min/avg/max = 337/337/337 1 packets sent via: 15.63.200.2 - [ name lookup failed ] 15.68.88.4 - [ name lookup failed ] 156.152.16.1 - [ name lookup failed ] 156.152.16.10 - [ name lookup failed ] 15.68.88.43 - [ name lookup failed ]

15.63.200.1 - [ name lookup failed ]



2-48

2–29. SLIDE: Network Performance Tools (HP-Specific)

Student Notes This slide shows the HP-specific networking performance tools included with HP-UX. The first three tools listed (lanadmin, lanscan, and ndd/nettune) come standard with the base OS. The NetMetrix product is an additional product. The HP-specific networking tools display additional networking information and allow tuning of various networking parameters.

Network Performance Tools(HP-Specific)

YesChange Kernel Networking Parametersnettune (10.x)

YesChange Kernel Networking Parametersndd (11.x)

YesCollects network performance data using RMON LAN probes

NetMetrix

NoLAN Hardware and Software Statuslanscan

YesLayer 2 Networking Statistics and NIC Reset

lanadmin


Resource



2-49

2–30. TEXT PAGE: lanadmin

The lanadmin command tests, displays statistics for, and allows modifications to LAN cards on the HP-UX system. Specific capabilities include:

• Resetting the LAN card and executing the LAN card self-tests

• Displaying and clearing LAN card statistics

• Changing the LAN card speed, the MTU size, and the link level address

Tool Source: HP


Interval: on demand

Data Source: Kernel registers and Network Interface Card


Metrics: Packet transmission status and errors


Overhead: minimal

Unique Feature: Allows LAN interface card to be reset.

Full Pathname: /usr/sbin/lanadmin

Pros and Cons: + provides extensive transmission statistics + allows for tuning of parameters normally requiring source code to change. - many statistics have little to no documentation

Syntax

/usr/sbin/lanadmin [-e] [-t] /usr/sbin/lanadmin [-a] [-A station_addr] [-m] [-M mtu_size] [-R] [-s] [-S speed] NetMgmtID -e Echo the input commands on the output device. -t Suppress the display of the command menu before each command prompt.

Example # lanadmin Test Selection mode. lan = LAN Interface Administration menu = Display this menu quit = Terminate the Administration verbose = Display command menu Enter command: lan



2-50

LAN Interface test mode. LAN Interface Net Mgmt ID = 4 clear = Clear statistics registers display = Display LAN Interface status and statistics registers end = End LAN Interface Administration, return to Test Selection menu = Display the menu ppa = PPA Number of the LAN Interface quit = Terminate the Administration, return to shell nmid = Network Management ID of the LAN Interface reset = Reset LAN Interface to execute its selftest specific = Go to Driver specific menu Enter command: display Network Management ID = 4 Description = lan0 Hewlett-Packard LAN Interface Hw Rev 0 Type (value) = ethernet-csmacd(6) MTU Size = 1500 Speed = 10000000 Station Address = 0x8000935c9bd Administration Status (value) = up(1) Operation Status (value) = up(1) Last Change = 14465 Inbound Octets = 3606105787 Inbound Unicast Packets = 2767086 Inbound Non-Unicast Packets = 88379016 Inbound Discards = 0 Inbound Errors = 464396 Inbound Unknown Protocols = 7114206 Outbound Octets = 458391388 Outbound Unicast Packets = 2842387 Outbound Non-Unicast Packets = 2874 Outbound Discards = 0 Outbound Errors = 0 Outbound Queue Length = 0 Specific = 655367 Ethernet-like Statistics Group Index = 4 Alignment Errors = 0 FCS Errors = 0 Single Collision Frames = 21353 Multiple Collision Frames = 42774 Deferred Transmissions = 281589 Late Collisions = 0 Excessive Collisions = 0 Internal MAC Transmit Errors = 0 Carrier Sense Errors = 0 Frames Too Long = 0 Internal MAC Receive Errors = 0



2-51

2–31. TEXT PAGE: lanscan

The lanscan command displays the LAN card configuration and status. Items displayed include: • Hardware address of LAN card slot

• Link level address of card

• Hardware status and interface status

• Other status and configuration information

Tool Source: HP


Interval: on demand

Data Source: Network Interface Card


Metrics: Interface status, Link Level Address


Overhead: minimal

Unique Feature: Shows Link Level Address of system.

Full Pathname: /usr/sbin/lanscan

Pros and Cons: + provides additional status information about network interface cards - no performance information

Syntax

lanscan [-ainv] [system [core]] -a Display station addresses only. No headings. -i Display interface names only. No headings. -n Display Network Management IDs only. No headings. -v Verbose output. Two lines per interface. Includes displaying of extended station address and supported encapsulation methods.

Examples

Output from a 10.x system:

# lanscan Hardware Station Crd Hardware Net-Interface NM MAC HP DLPI Mjr Path Address In# State NameUnit State ID Type Support Num 2/0/2 0x080009D2C2DE 0 UP lan0 UP 4 ETHER Yes 52

Output from an 11.x system:



2-52

# lanscan Hardware Station Crd Hdw Net-Interface NM MAC HP-DLPI DLPI Path Address In# State NamePPA ID Type Support Mjr# 2/0/2 0x08000978BDB0 0 UP lan0 snap0 1 ETHER Yes 119



2-53

2–32. TEXT PAGE: nettune (HP-UX 10.x Only)

The nettune command allows modifications to be made to network parameters, which in previous releases were not modifiable. This command was not included with any HP-UX 11.x release. Parameters that can be modified with nettune include: • arp configuration • socket buffer sizes • enable or disable IP forwarding CAUTION: Use caution when making modifications with the tool. It is possible to hurt

network performance severely or disable the LAN card when using this tool.

Tool Source: HP

Documentation: man pages, nettune help options (-?, -l, -h)

Interval: on demand

Data Source: Kernel registers and NIC


Metrics: LAN tunable parameters


Overhead minimal

Unique Feature: Change values of network parameters, which cannot otherwise be changed

Change TCP send and receive buffer sizes without need for source code

Full Pathname: /usr/contrib/bin/nettune

Pros and Cons: + provides ability to modify networking behavior without needing source code

+ provides access to tunable parameters normally not available

- can have a negative impact on performance if used the wrong way

- minimal documentation

Syntax

nettune [-w] object [parm...] nettune -h [-w] [object] nettune -l [-w] [-b size] [object [parm...]] nettune -s [-w] object [parm...] value... -h (help) Print all information related to the object. This information provides helpful hints about changing the value of an object.

-l (list) Print information regarding changing the value of object.



2-54

-s (set) Set object to value. An object may require more than one value. -w Display warning messages (for example, 'value truncated'). These are normally discarded when the command is successful.

Examples

To get help information on all defined objects: nettune -h arp_killcomplete: The number of seconds that an arp entry can be in the completed state between references. When a completed arp entry is unreferenced for this period of time, it is removed from the arp cache. . . .

To get help information on all TCP-related objects:

nettune -h tcp tcp_receive: The default socket buffer size in bytes for inbound data. tcp_send: The default socket buffer size in bytes for outbound data. . . .

To set the value of the ip_forwarding object to 1:

nettune -s ip_forwarding 1

To get the value of the tcp_send object (socket send buffer size):

nettune tcp_send



2-55

2–33. TEXT PAGE: ndd (HP-UX 11.x Only)

The ndd command allows the examination and modification of several tunable parameters that affect networking operation and behavior. It accepts arguments on the command line or may be run interactively. The -h option displays all the supported and unsupported tunable parameter that ndd provides. CAUTION: ndd was ported to HP-UX and contains references to some parameters that

have not been implemented on the HP-UX O/S at this time. Reference the man page when in doubt. (Just because you can display a symbol's value and set it doesn't necessarily mean that the HP-UX kernel references the symbol!)

The ndd utility command accesses kernel parameters through the use of "pseudo device files". These pseudo device files are referred to as a network device on the ndd command line and selected from the following list:

/dev/arp For ARP cache-related values /dev/ip For IP routing and forwarding parameters /dev/rawip Default IP time-to-live header value /dev/tcp Transport Connect Protocol (connection based) parameters /dev/udp User Datagram Protocol (connectionless) parameters

Tool Source: HP

Documentation: man pages, ndd -h (for help options)

Interval: on demand

Data Source: network device pseudo device files (reference above)


Metrics: LAN tunable parameters


Overhead minimal

Unique Feature: Change values of network parameters, which cannot otherwise be changed

Full Pathname: /usr/bin/ndd

Pros and Cons: + provides ability to modify networking behavior without needing source code

+ provides access to tunable parameters normally not available

- can have a negative impact on performance if used the wrong way

- minimal documentation

Syntax

ndd -get network device parameter



2-56

ndd -set network device parameter ndd -h sup[ported] ndd -h unsup[ported] ndd -h [parameter] ndd -c

At boot:

The file /etc/rc.config.d/nddconf contains tunable parameters that will be set automatically each time the system boots.

Examples

To list the contents of the "arp cache": ndd -get /dev/arp arp_cache_report

To get help information on all supported tunable parameters: ndd -h supported To get a detail description of the tunable parameter, ip_forwarding: ndd -h ip_forwarding To get the current value of the tunable parameter, ip_forwarding: ndd -get /dev/ip ip_forwarding To set the value of the default TTL parameter for UDP to 128: ndd -set /dev/udp udp_def_ttl 128 To re-read the configuration file, /etc/rc.config.d/nddconf without rebooting the system: ndd -c



2-57

2–34. TEXT PAGE: NetMetrix (HP-UX 10.20 and 11.0 Only)

The NetMetrix product makes use of LAN probes to collect network traffic information. The LAN probes attach to the physical network and collect detailed information regarding the packets that pass through the probe. Tools available with NetMetrix include: • packet decoders • network alarming capabilities • reports including top packet generating systems • data collection for trending Tool Source: HP

Documentation: man pages, NRF (Network Response Facility) manual

Interval: on demand

Data Source: LAN probes

Type of Data: LAN traffic

Metrics: number of packets through cross-section of network

Logging: NetMetrix binary file

Overhead: Varies, depending on the number of LAN probes

Unique Feature: Provides statistics regarding traffic on the entire network

Pros and Cons: + Statistics regarding total packet traffic - Additional cost - Requires LAN probes

NetMetrix Notes

• NetMetrix makes use of highly sophisticated devices (LAN probes) capable of collecting large amounts of detailed network information.

• NetMetrix is a truly distributed network management product that makes use of "mid-level managers" for data storage and alarming.

• There are a number of modules available with NetMetrix.

• NetMetrix's Internet Response Manager (IRM) and Internet Response Agent (IRA) fully integrate with HP OpenView products to provide a complete system and network management solution.



2-58

2–35. SLIDE: Performance Administrative Tools (Standard UNIX)

Student Notes This slide shows the standard UNIX administrative performance tools included with HP-UX. These tools are used to tune or modify system resources to better improve the performance of a system. These tools are typically used to change or tune a system's component, as opposed to viewing or displaying characteristics about the component. Only the root user is allowed to use these commands, as making these modifications affects the performance for all users on the system. NOTE: The ipcs program is really a performance-monitoring command; however,

because it is usually run in conjunction with ipcrm, it is covered here to emphasize the relationship between the two commands.

Performance Administrative Tools (Standard UNIX)

YesSetting Process Prioritiesnice

YesModifying Process Prioritiesrenice

YesDestroy Semaphores, Message Queues, and Shared Memory Segments

ipcrm

NoList Semaphores, Message Queues, and Shared Memory Segments

ipcs


Resource



2-59

2–36. TEXT PAGE: ipcs, ipcrm

The ipcs command displays information about active interprocess communication facilities. With no options, ipcs displays information in short format about message queues, shared memory segments, and semaphore sets that are currently active in the system. The ipcrm command removes one or more specified message-queue, semaphore-set, or shared-memory identifiers. Tool Source: Standard UNIX (System V)


Interval: on demand


Type of Data: Global, limited process

Metrics: semaphore sets, message queues, shared memory


Overhead: varies, depending on the IPC resource in use

Unique Feature: Shows the size, owner, and last user of message queues and shared memory segments.

Full Pathname: /usr/bin/ipcs and /usr/bin/ipcrm

Pros and Cons: + shows orphan IPC entries + shows size of message queues and shared memory segments - process information limited to owner and last user

Syntax

ipcrm [-m shmid] [-q msqid] [-s semid] ipcs [-mqs] [-abcopt] [-C corefile] [-N namelist] -m Display information about active shared memory segments. -q Display information about active message queues. -s Display information about active semaphore sets. -b Display largest-allowable-size information -c Display creator's login name and group name -o Display information on outstanding usage -p Display process number information -t Display time information



2-60

Examples

# ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:56:36 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-ra- root sys s 3 0x412000a9 --ra-ra-ra- root root s 4 0x00446f6e --ra-r--r-- root root s 6 0x01090522 --ra-r--r-- root root s 7 0x013d8483 --ra-r--r-- root root s 200 0x4c1c2f79 --ra-r--r-- daemon daemon # ipcrm -s 7 # ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:57:42 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-ra- root sys s 3 0x412000a9 --ra-ra-ra- root root s 4 0x00446f6e --ra-r--r-- root root s 6 0x01090522 --ra-r--r-- root root s 200 0x4c1c2f79 --ra-r--r-- daemon daemon



2-61

2–37. TEXT PAGE: nice, renice

The nice command executes a command at a nondefault CPU scheduling priority. (The name is derived from being "nice" to other system users by running large programs at a weaker priority.) The renice command alters the nice value of a existing process. Tool Source: Standard UNIX (System V)


Interval: on demand

Data Source: process table

Type of Data: processes

Metrics: priority

Logging: standard output device

Overhead: minimal

Unique Feature:

Full Pathname: /usr/bin/nice and /usr/bin/renice

Pros and Cons: + allows less important processes to run in the background + allows more important processes to run in the foreground - not an intuitive interface or syntax

Syntax

nice [-n newoffset_from_default_20] command [command_args] renice [-n newoffset_from_current_value] [-g|-p|-u] id ...

An unsigned newoffset increases the system nice value for the command or process, causing it to run at a weaker priority. A negative value requires superuser privileges, and assigns a lower system nice value (strongerer priority) to the process.

Examples # ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 S 0 6044 6042 1 158 20 ff6680 85 87cec0 ttyp2 0:00 sh 1 R 0 8286 6044 6 179 20 1003d80 22 - ttyp2 0:00 ps # nice sh # ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 S 0 6044 6042 11 158 20 ff6680 85 87cec0 ttyp2 0:00 sh 1 S 0 8290 8287 0 158 30 ff1680 85 100d3e0 ttyp2 0:00 sh 1 R 0 8293 8290 4 199 30 feae80 22 - ttyp2 0:00 ps # exit



2-62

# nice -10 sh # ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 S 0 6044 6042 0 158 20 ff6680 85 87cec0 ttyp2 0:00 sh 1 R 0 8297 8294 7 199 30 ff1280 22 - ttyp2 0:00 ps 1 S 0 8294 6044 10 158 30 fea380 121 87e0c0 ttyp2 0:00 sh # nice -5 ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 S 0 6044 6042 0 158 20 ff6680 85 87cec0 ttyp2 0:00 sh 1 R 0 8304 8294 10 210 35 1003e80 22 - ttyp2 0:00 ps 1 S 0 8294 6044 10 158 30 fea380 121 87e0c0 ttyp2 0:00 sh # nice -n 30 sh # ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 S 0 6044 6042 0 158 20 ff6680 85 87cec0 ttyp2 0:00 sh 1 S 0 8305 8294 19 158 39 fb3300 121 87d6c0 ttyp2 0:00 sh 1 S 0 8294 6044 6 158 30 fea380 121 87e0c0 ttyp2 0:00 sh 1 R 0 8308 8305 4 220 39 feae80 22 - ttyp2 0:00 ps # exit # nice -n -30 sh # ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME COMD 1 S 0 6044 6042 0 158 20 ff6680 85 87cec0 ttyp2 0:00 sh 1 S 0 8306 8294 1 158 30 f86200 121 87dc40 ttyp2 0:00 sh 1 S 0 8309 8306 7 158 0 fea380 121 87e0c0 ttyp2 0:00 sh 1 R 0 8312 8309 6 139 0 1003980 22 - ttyp2 0:00 ps



2-63

2–38. SLIDE: Performance Administrative Tools (HP-Specific)

Student Notes This slide shows the HP-specific administrative performance tools available on HP-UX systems. Many of the tools shown on the slide come standard with the base OS. The only tools that are add-on products are PRM, WLM, WebQoS, and Advanced JFS (getext, setext, and fsadm). These HP-specific tools were developed to allow modifications and performance enhancements to the functionality unique to the HP-UX operating system.

Performance Administrative Tools(HP-Specific)

YesSet parameters on SCSI devicesscsictl

Privileged AccessMark a program to run seriallyserialize

YesOnline JFS management toolfsadm

NoDisplay JFS extent attributesgetext

YesSets/changes JFS extent attributessetext

YesCreate a file systemnewfs

YesChange a file system’s attributestunefs/vxtunefs

YesProcess Resource Mgr/Work Load MgrPRM/WLM

Privileged AccessSet real time process priority (HP)rtprio

Privileged AccessSet POSIX real time process priorityrtsched

YesWeb Quality of ServiceWebQoS

YesAllocate special system privilegessetprivgrp

NoList system privileged groupsgetprivgrp


Resource



2-64

2–39. TEXT PAGE: getprivgrp, setprivgrp

The getprivgrp command lists the access privileges of privileged groups. The setprivgrp command sets the access privileges of privileged groups. If a group_name is supplied, access privileges are listed for that group only. The superuser is a member of all groups. Access privileges include RTPRIO, RTSCHED, MLOCK, CHOWN, LOCKRDONLY, SETRUGID, MPCTL, SPUCTL, and SERIALIZE. Tool Source: HP


Interval: on demand

Data Source: /etc/group and kernel data structures

Type of Data: users and groups

Metrics: privilege access


Overhead: minimal

Unique Feature: Gives non-root users access to privileges normally requiring root access.

Full Pathname /usr/bin/getprivgrp and /usr/sbin/setprivgrp

Pros and Cons: + ability to assign additional privileges to groups - requires additional system management - cannot give privilege to a single user; must assign privileges to groups

Syntax

getprivgrp [-g|group_name] setprivgrp [-g|groupname] [privileges] -g Specify global privileges that apply to all groups.

Examples

# getprivgrp global privileges: CHOWN # setprivgrp class CHOWN SERIALIZE RTPRIO # getprivgrp global privileges: CHOWN class: RTPRIO CHOWN SERIALIZE



2-65

Notes

• Group privileges which can be modified are: RTPRIO Can use rtprio() call to set real-time priorities.

RTSCHED Can use sched_setparam() call and sched_setscheduler() call to set POSIX.4 real-time priorities.

MLOCK Can use plock() to lock process text and data into memory, and the shmctl() SHM_LOCK function to lock shared memory segments

CHOWN Can use chown() to change file ownership.

LOCKRDONLY Can use lockf() to set locks on files that are open for reading only.

SETRUGID Can use setuid() and setgid() to change, respectively, the real user ID and real group ID of a process.

SERIALIZE Can use serialize() to force the target process to run serially with other processes that are also marked by this system call.

MPCTL Can use mpctl() to lock a process or a thread to a specific processor on SMP systems. If processor sets are available, can be used to lock a process or a thread to a specific processor set.

SPUCTL Can use spuctl() to enable and disable specific processors on SMP

systems. (V-class, T-class, N-class, L-class, and Superdome only)



2-66

2–40. TEXT PAGE: rtprio

The rtprio command executes a specified command with a real-time priority, or changes the real-time priority of a currently executing process with a specific PID. Real-time priorities range from zero (strongest) to 127 (weakest). Real-time processes are not subject to priority degradation and are considered of greater importance than all non-real-time processes. CAUTION: Special care should be taken when using this command. It is possible to lock

out other processes (including system processes) when using this command. Tool Source: HP


Interval: on demand


Type of Data: process

Metrics: process priority

Logging: none

Overhead: varies, depending on the activity of the process

Unique Feature: assign real time priority to a process

Full Pathname: /usr/bin/rtprio

Pros and Cons: + Can significantly improve the performance of a program - Can severely impact the performance of the system (if used

incorrectly)

Syntax

rtprio priority command [arguments] rtprio priority -pid rtprio -t command [arguments] rtprio -t -pid -t execute command with a timeshare (non-real-time) priority, or change the currently executing process pid from a possibly real-time priority to a timeshare priority.

Examples

Execute file a.out at a real-time priority of 100: rtprio 100 a.out Set the currently running process PID 24217 to a real-time priority of 40: rtprio 40 24217



2-67

2–41. TEXT PAGE: rtsched

The rtsched executes commands with POSIX or HP-UX real-time priority, or changes the real-time priority of currently executing process PID. All POSIX real-time priority processes are of greater scheduling importance than processes with HP-UX real-time or HP-UX timeshare priority. Neither POSIX nor HP-UX real-time processes are subject to degradation. POSIX real-time processes can be scheduled with one of three different POSIX scheduling policies specified: SCHED_FIFO, SCHED_RR, or SCHED_RR2. The number of POSIX real-time priority queues is tunable between the values of 32 and 512, and show up as a negative number between -1 and -512 when viewed with the ps -ef or ps –el commands. CAUTION: Special care should be taken when using this command. It is possible to lock

out other processes (including system processes) when using this command. Tool Source: HP

Documentation: man pages (also see rtsched(2) )

Interval: on demand



Metrics: process priority

Logging: none

Overhead: varies, depending on the activity of the process

Unique Feature: assign real time priority to a process

Full Pathname: /usr/bin/rtsched Pros and Cons: + Can significantly improve the performance of a program - Can severely impact the performance of the system (if used

incorrectly)

Syntax

rtsched -s scheduler -p priority command [arguments] rtsched [ -s scheduler ] -p priority -p pid -s Specifies which scheduler to use, SCHED_FIFO (POSIX real-time), SCHED_RR (POSIX real-time), SCHED_RR2 (POSIX real-time), SCHED_RTPRIO (HP-UX real-time), or SCHED_HPUX (HP-UX timeshare)



2-68

Examples

Execute file a.out at a POSIX real-time priority of 4: rtsched -s SCHED_FIFO -p 4 a.out Set the currently running process pid 24217 to a real-time priority of 20: rtsched -s SCHED_RR -p 20



2-69

2–42. TEXT PAGE: scsictl

The scsictl command provides a mechanism for controlling a SCSI device. It can be used to query mode parameters, set configurable mode parameters, and perform SCSI commands. The operations are performed in the same order as they appear on the command line. Tool Source: HP


Interval: on demand

Data Source: SCSI disks

Type of Data: disks

Metrics: immediate reporting, I/O queue


Overhead: minimal

Unique Feature: Provides control over the behavior of an individual SCSI disk

Full Pathname: /usr/sbin/scsictl

Pros and Cons + can improve performance by modifying the drive behavior - not all SCSI devices support the command - could misconfigure a disk, causing data to be lost in the event of a

system crash

Syntax

scsictl [-akq] [-c command]... [-m mode[=value]]... device -a Display the status of all mode parameters available. -m mode Display the status of the specified mode parameter. ir For devices that support immediate reporting, this displays the immediate reporting status. queue_depth For devices that support a queue depth greater than the system default, this mode controls how many I/Os the driver will attempt to queue to the device at any one time. -m mode=value Set the mode parameter mode to value. The available mode parameters and values are listed above.



2-70

Examples

To display a list of all of the mode parameters, turn immediate_report on, and redisplay the value of immediate_report.

scsictl -a -m ir=1 -m ir /dev/rdsk/c0t6d0

will produce the following output: immediate_report = 0; queue_depth = 8; immediate_report = 1



2-71

2–43. TEXT PAGE: serialize

The serialize command is used to force the target process to run serially with other processes also marked by this command. Once a process has been marked by serialize, the process stays marked until process completion, unless serialize is reissued. Tool Source: HP


Interval: on demand



Metrics: priority


Overhead: minimal

Unique Feature: decreases CPU and memory contention problems using standard functionality.

Full Pathname: /usr/bin/serialize

Pros and Cons: + allows system to behave more efficiently when CPU and memory resources are scarce.

- minimal documentation - only helps when CPU and memory resources are scarce

Syntax

serialize command [command_args] serialize [-t] [-p pid] -t Indicates the process specified by pid should be returned to timeshare scheduling.

Examples

Use serialize to force a database application to run serially with other processes marked for serialization. Type:

serialize database_app

Force a currently running process with a PID value of 215 to run serially with other processes marked for serialization. Type:

serialize -p 215

Return a process previously marked for serialization to normal timeshare scheduling. The PID of the target process for this example is 174. Type: serialize -t -p 174



2-72

2–44. TEXT PAGE: fsadm

The fsadm command is designed to perform selected administration tasks on HFS (10.20 or later) and JFS file systems. These tasks may differ between file system types. For HFS file systems, fsadm allows conversions between large and nolarge files. For VxFS file systems, fsadm allows file system resizing, extent (and directory) reorganization, and large/nolarge file conversions. Tool Source: Veritas and HP


Interval: on demand

Data Source: File System superblock and header structures

Type of Data: file system header and data

Metrics: fragmentation


Overhead: Medium to large (up to 33%), depending on number of user and amount of activity

Unique Features: Can defragment a file system, improving performance. (JFS) Can increase the size of a file system while it's mounted. (JFS)

Full Pathname: /usr/sbin/fsadm

Pros and Cons: + provides greater manageability of file systems - many features (including defragmentation) are only available for JFS - requires purchasing the AdvanceJFS or OnlineJFS product.

Syntax

/usr/sbin/fsadm [-F vxfs|hfs] [-V] [-o largefiles|nolargefiles] mount_point|special /usr/sbin/fsadm [-F vxfs] [-V] [-b newsize] [-r rawdev] mount_point /usr/sbin/fsadm [-F vxfs] [-V] [-d] [-D] [-s] [-v] [-a days] [-t time] [-p passes] [-r rawdev] mount_point

Examples

HFS Example

Convert a nolargefiles HFS file system to a largefiles HFS file system:

fsadm -F hfs -o largefiles /dev/vg02/lvol1

Display relevant HFS file system statistics:

fsadm -F hfs /dev/vg02/lvol1



2-73

JFS Example

Increase the size of the var file system to 100 MB while it is mounted and online:

lvextend -L 100 /dev/vg00/lvol7 fsadm -F vxfs -b 102400 /var

Display fragmentation statistics for the /home file system:

fsadm -D -E /home



2-74

2–45. TEXT PAGE: getext, setext

The getext command displays extent attribute information of associated files on a JFS file system. The setext command allows attributes related to JFS file systems and files within the JFS file system to be modified and tuned. Tool Source: Veritas


Interval: on demand

Data Source: JFS file system

Type of Data: File system metadata structures

Metrics: File system space allocation


Overhead: minimal

Unique Feature: Allows attributes of JFS files to be set

Full Pathname: /usr/sbin/getext and /usr/sbin/setext

Pros and Cons: + can improve file system performance by modifying file attributes - require purchase of the AdvancedJFS or OnlineJFS product

Syntax

/usr/sbin/getext [-V] [-f] [-s] file... /usr/sbin/setext [-V] [-e extent_size] [-r reservation] [-f flag] file

Example

Display file attributes for the file, file1:

getext file1 file1: Bsize 1024 Reserve 36 Extent Size 3 align noextend

The above output indicates a file with 36 blocks of reservation, a fixed extent size of 3 blocks, all extents aligned to 3-block boundaries, and the file cannot be extended once the current reservation is exhausted.



2-75

2–46. TEXT PAGE: newfs, tunefs, vxtunefs

The newfs command is a "friendly" front-end to the mkfs command. The newfs command calculates the appropriate parameters and then builds the file system by invoking the mkfs command. The tunefs command displays detailed configuration information for an HFS file system and allows some of the file system parameters to be modified. Tool Source: BSD 4.x, modified by HP, Veritas


Interval: not applicable, on demand

Data Source: file system header and superblock

Type of Data: file system metadata structures

Metrics: Block size, Fragment size, Mininum space

Logging: standard output

Overhead: minimal

Unique Feature: Allows file system parameters to be displayed and set.

Full Pathname: /usr/sbin/newfs, /usr/sbin/tunefs, /usr/sbin/vxtunefs

Pros and; Cons: + File system parameters can be viewed and tuned for optimal performance

- To tune many parameters, a re-initialization of the file system is required

Syntax

/usr/sbin/newfs [-F FStype] [-o specific_options] [-V] special /usr/sbin/tunefs [-A] [-v] [-a maxcontig] [-d rotdelay] [-e maxbpg] [-m minfree] special-device /usr/sbin/vxtunefs

Notes

The initial file system parameters are set when the file system is first created with newfs. A small set of these parameters can be changed after the file system is created with tunefs. vxtunefs changes the attributes of the JFS file system when the file system is mounted.

NOTE: The tunefs command works only for HFS file systems. The JFS file systems use other commands (getext, setext, vxtunefs).



2-76

Examples

Create a file system on vg01 called lvol1.

newfs -F hfs -b 16384 -f 2048 /dev/vg01/rlvol1 mkfs (hfs): Warning - 2 sector(s) in the last cylinder are not allocated. mkfs (hfs): /dev/vg01/rlvol1 - 20480 sectors in 133 cylinders of 7 tracks, 22 sectors 21.0Mb in 9 cyl groups (16 c/g, 2.52Mb/g, 384 i/g) Super block backups (for fsck -b) at: 16, 2512, 5008, 7504, 10000, 12496, 14992, 17488, 19728

View the file system's configuration parameters:

tunefs -v /dev/vg01/rlvol91 super block last mounted on: magic 95014 clean FS_CLEAN time Fri Nov 28 07:02:58 1997 sblkno 8 cblkno 16 iblkno 24 dblkno 48 sbsize 2048 cgsize 2048 cgoffset 16 cgmask 0xfffffff8 ncg 9 size 10240 blocks 9858 bsize 16384 bshift 14 bmask 0xffffc000 fsize 2048 fshift 11 fmask 0xfffff800 frag 8 fragshift 3 fsbtodb 1 minfree 10% maxbpg 38 maxcontig 1 rotdelay 0ms rps 60 csaddr 48 cssize 28672 csshift 10 csmask 0xfffffc00 ntrak 7 nsect 22 spc 154 ncyl 133 cpg 16 bpg 154 fpg 1232 ipg 384 nindir 4096 inopb 128 nspf 2 nbfree 1230 ndir 2 nifree 3452 nffree 9 cgrotor 0 fmod 0 ronly 0 fname fpack cylinders in last group 5 blocks in last group 48

For VxFS file systems use:

# fsdb -F vxfs /dev/vg/NN/rlvolN > 8192 B > p S



2-77

2–47. TEXT PAGE: Process Resource Manager (PRM) Process Resource Manager (PRM) allows the administrator to guarantee that important processes will receive the amount of memory, disk, and CPU time required to meet your performance objectives. PRM works in conjunction with the standard HP-UX scheduler to improve response times for critical applications. PRM provides state-of-the-art resource allocation that has long been missing in the UNIX environment. Tool Source: HP

Documentation: PRM man pages (prmconfig)

Interval: on demand

Data Source: kernel registers and counters

Type of Data: process groups as defined by the PRM configuration file.

Metrics: CPU time, memory, and disk I/O bandwidth allocated to groups of processes

Logging: standard output, glance, gpm, perfview/OVPM, measureware/OVPA

Overhead: PRM only applies to time-shared processes. Real-time processes are not affected.

Unique Features: allows the system administrator to control which groups of processes receive a certain percentage of the CPU's time, memory paging, and/or disk I/O request preference.

CPU (per PRM group) entitlement and capping

DISK (per PRM group per VG) entitlement

Memory (per PRM group) entitlement, capping and selection method

Application (per PRM group)

Full Pathname: /usr/sbin/prmconfig Pros and Cons: + Greater control of resource distributions

- Optional product. Does not come standard with the OS. If you are running 11i in the Enterprise or Mission Critical Operating Environments, PRM is included.

See the course U5447S – “HP-UX Resource Management with PRM & WLM” for a more complete discussion of PRM.



2-78

2–48. TEXT PAGE: Work Load Manager (WLM) The Work Load Manager sits on top of PRM and tunes it as necessary to meet the desired performance goals. The goals are defined in a configuration file in the form of Service Level Objectives (SLOs). The administrator defines these goals in the file and then lets WLM “tweek” PRM until either the goals are reached or they are attained as closely as possible. Tool Source: HP

Documentation: WLM man pages (wlmd)

Interval: on demand

Data Source: kernel registers and counters

Type of Data: process groups as defined by the WLM configuration file

Metrics: As defined in the WLM configuration file

Logging: Data can be sent to an EMS (Event Monitoring System)

Overhead: Data collection of defined metrics and adjusting of PRM configuration

Unique Features: allows the system administrator to define what Service Level Objectives are desired on the system and lets WLM to “tune” the system (via PRM) to obtain performance as close to those objectives as possible.

CPU (per WLM group) entitlement

DISK (per WLM group per VG) entitlement

Memory (per WLM group) entitlement

Application (per WLM group)

Full Pathname: /opt/wlm/bin/wlmd Pros and Cons: + Greater control of CPU distribution

- Optional product. Does not come standard with the OS.

See the course U5447S – “HP-UX Resource Management with PRM & WLM” for a more complete discussion of WLM.



2-79

2–49. TEXT PAGE: Web Quality of Service — WebQoS WebQoS manager is an example of the growing number of system performance management/enhancement products focused on specific server applications and environments. The modern paradigm for application server management requires looking past simple performance metrics and forces us to start think a little out of the box. Do all requests received by a Web server warrant the same level of service? WebQoS allows the administrator to make decisions on the service level, based on several different criteria: • admission control • user differentiation • activity differentiation • application differentiation. A discussion of the specifics of this product is beyond the scope of this class.

Tool Source: HP

Typical metrics: Number of concurrent users and response times.

Purpose: Maximize successful customer interactions and peak throughput.

Pros and Cons: + Greater control of Web server resources tuned to specific client requests. - Optional product. Does not come standard with the OS.



2-80

2–50. SLIDE: System Configuration and Utilization Information (Standard UNIX)

Student Notes This slide shows the standard UNIX tools for displaying system configuration and utilization information on an HP-UX system. System configuration and utilization tools are those which display configurations of LVM disks, file systems, and kernel resources.

System Configuration and Utilization Information (Standard UNIX)

YesLocal and remote file system mountsmount

YesMounted file system spacedf

SomeLocal and remote mounted file system spacebdf

PortabilityResource



2-81

2–51. TEXT PAGE: bdf, df

The bdf command displays the amount of free disk space available. If no file system is specified, the free space on all of the normally mounted file systems is printed. Free inode information can be displayed by using the –i option. The df command displays the number of free 512-byte blocks and free inodes available for file systems by examining the counts kept in the superblock or superblocks. Blocks can be displayed in 1KB sizes by using the –k option. Tool Source: df, Standard UNIX (System V) bdf, Standard UNIX (Berkeley 4.x)


Interval: on demand

Data Source: File system superblocks

Type of Data: Disk space resources

Metrics: Disk space utilization

Logging: Standard output

Overhead: Minimal

Unique Feature: Shows how much disk space is being utilized.

Full Pathname: /usr/bin/bdf, /usr/bin/df

Pros and Cons: + Easy to use - minimal tuning statistics

Syntax

/usr/bin/bdf [-b] [-i] [-l] [-t type | [filesystem|file] ... ] /usr/bin/df [-befgiklnv] [-t|-P] [-o specific_options] [-V] [special|directory]...

Examples — bdf Command

# bdf /usr Filesystem kbytes used avail %used Mounted on /dev/vg00/lvol7 307200 279059 -9635 103% /usr # bdf -i / Filesystem kbytes used avail %used iused ifree %iuse Mounted on /dev/vg00/lvol3 40960 25093 14869 63% 3284 3960 45% / # bdf -ib /home Filesystem kbytes used avail %used iused ifree %iuse Mounted on dev/vg00/lvol4 53248 3586 46546 7% 513 12407 4% /home Swapping 53248 0 40546 0% /home/pagin g # ll /home/paging total 0



2-82

Examples — df Command

# df /home (/dev/vg00/lvol4 ): 93062 blocks 12403 i-nodes /opt (/dev/vg00/lvol5 ): 177124 blocks 23598 i-nodes /tmp (/dev/vg00/lvol6 ): 90010 blocks 11982 i-nodes /usr (/dev/vg00/lvol7 ): 52732 blocks 7011 i-nodes /var (/dev/vg00/lvol8 ): 100122 blocks 13320 i-nodes /stand (/dev/vg00/lvol1 ): 23596 blocks 5358 i-nodes



2-83

2–52. TEXT PAGE: mount

The mount command is used to mount file systems on the system. Other users can use mount to list mounted file systems. If mount is invoked without any arguments, it lists all of the mounted file systems from the file system mount table, /etc/mnttab. Tool Source: standard UNIX (System V)


Interval: on demand

Data Source: kernel mount table and /etc/mnttab file

Type of Data: file system

Metrics: file system type and mount options

Logging: the file /etc/mnttab and standard output

Overhead: minimal

Unique Feature: used to mount HFS, JFS, and NFS file systems.

Full Pathname: /sbin/mount

Pros and Cons: + displays valuable data regarding how file systems are mounted - different options depending on the type of file system being mounted

Syntax

/usr/sbin/mount [-l] [-p|-v]

Examples # mount -p /dev/root / vxfs log 0 0 /dev/vg00/lvol1 /stand hfs defaults 0 0 /dev/vg00/lvol6 /usr vxfs delaylog 0 0 /dev/vg00/lvol5 /tmp vxfs delaylog 0 0 /dev/vg00/lvol4 /opt vxfs delaylog 0 0 /dev/dsk/c0t4d0 /disk hfs defaults 0 0 /dev/vg00/lvol7 /var vxfs delaylog 0 0 # mount -v /dev/root on / type vxfs log on Thu Sep 11 12:15:08 1997 /dev/vg00/lvol1 on /stand type hfs defaults on Thu Sep 11 12:15:11 1997 /dev/vg00/lvol6 on /usr type vxfs delaylog on Thu Sep 11 12:17:06 1997 /dev/vg00/lvol5 on /tmp type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/vg00/lvol4 on /opt type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/dsk/c0t4d0 on /disk type hfs defaults on Thu Sep 11 12:17:08 1997 /dev/vg00/lvol7 on /var type vxfs delaylog on Thu Sep 11 12:17:23 1997 #



2-84

2–53. SLIDE: System Configuration and Utilization Information (HP-Specific)

Student Notes This slide shows the HP-specific commands for displaying system configuration and utilization information. All the commands on the slide come standard with the base OS; none are add-on products. These commands display the configuration and utilization of HP-specific subsystems. Many of these commands have corresponding commands on other UNIX systems that perform similar functions.

System Configuration and Utilization Information (HP-Specific)

NoLocal physical volume contents/attributespvdisplay

NoLocal logical volume contents/attributeslvdisplay

NoSwap space utilizationswapinfo

SomeSizes and values of kernel tables and parmssysdef

SomeQuery, set, or reset system parameterskmtune

SomeQuery, set, or reset system configurationkcweb

NoI/O tree and addressingioscan

NoLocal volume group contents/attributesvgdisplay

SomeI/O tree and memory detailsdmesg

NoSize and model of local disk drivesdiskinfo

PortabilityResource



2-85

2–54. TEXT PAGE: diskinfo

The diskinfo command determines whether the character special file named by character_devicefile is associated with a SCSI, CS/80, or Subset/80 disk drive. If so, diskinfo summarizes the disk's characteristics. Both the size of disk and bytes per sector represent formatted media. Tool Source: HP


Interval: on demand

Data Source: controller on disk

Type of Data: disk specific

Metrics: disk capacity, sector size


Overhead: minimal

Unique Feature: shows model number and manufacturer of disk

Full Pathname: /usr/sbin/diskinfo

Pros and Cons: + can determine size and manufacturer of disk without having to open system

- minimal tuning information

Syntax

/usr/sbin/diskinfo [-b|-v] character_devicefile

The diskinfo command displays information about the following characteristics of disk drives:

• vendor name, manufacturer of the drive (SCSI only) • product identification number or ASCII name • type, CS/80 or SCSI classification for the device • size of disk specified in bytes • sector size, specified as bytes per sector

Example

# diskinfo /dev/rdsk/c0t6d0 SCSI describe of /dev/rdsk/c0t6d0: vendor: QUANTUM product id: PD425S type: direct access size: 416575 Kbytes bytes per sector: 512



2-86

2–55. TEXT PAGE: dmesg

The dmesg command looks in a system buffer for recently printed diagnostic messages and prints them on the standard output. The messages are those printed by the system when unusual events occur (such as when system tables overflow or the file systems get full). Tool Source: HP


Interval: on demand

Data Source: kernel diagnostic buffer

Type of Data: system diagnostic messages

Metrics: kernel startup information


Overhead: minimal

Unique Feature: displays kernel diagnostic messages

Full Pathname: /sbin/dames

Pros and Cons: + Allows kernel diagnostic messages to be recalled - Diagnostic messages can be lost since kernel buffer is a fixed size

Syntax

/usr/sbin/dmesg [-]

If the - argument is specified, dmesg computes (incrementally) the new messages since the last time it was run and places these on the standard output. This is typically used with cron (see cron(1)) to produce the error log /var/adm/messages by running the command:

/usr/sbin/dmesg - >> /var/adm/messages

every 10 minutes.

Example # dmesg Oct 17 12:39 vuseg=1815000 inet_clts:ok inet_cots:ok 1 graph3 2 bus_adapter 2/0/1 c720 2/0/1.0 tgt 2/0/1.0.0 stape 2/0/1.2 tgt 2/0/1.2.0 sdisk 2/0/1.3 tgt 2/0/1.3.0 stape 2/0/1.4 tgt 2/0/1.4.0 sdisk 2/0/1.7 tgt 2/0/1.7.0 sctl 2/0/2 lan2 2/0/3 hil



2-87

2/0/4 asio0 2/0/5 asio0 2/0/6 CentIf 2/0/7 c720 2/0/7.5 tgt 2/0/7.5.0 sdisk 2/0/7.6 tgt 2/0/7.6.0 sdisk 2/0/7.7 tgt 2/0/7.7.0 sctl 2/0/8 audio 4 eisa 4/0/4 lan2 8 processor 9 memory System Console is on the ITE Networking memory for fragment reassembly is restricted to 36265984 bytes Logical volume 64, 0x3 configured as ROOT Logical volume 64, 0x2 configured as SWAP Logical volume 64, 0x2 configured as DUMP Swap device table: (start & size given in 512-byte blocks) entry 0 - major is 64, minor is 0x2; start = 0, size = 819200 Dump device table: (start & size given in 1-Kbyte blocks) entry 0 - major is 31, minor is 0x26000; start = 68447, size = 393217 Starting the STREAMS daemons. B2352B HP-UX (B.10.20) #1: Sun Jun 9 08:03:38 PDT 1996 Memory Information: physical page size = 4096 bytes, logical page size = 4096 bytes Physical: 393216 Kbytes, lockable: 302512 Kbytes, available: 349504 Kbytes Using 1932 buffers containing 15360 Kbytes of memory. SCSI: Request Timeout -- lbolt: 7543017, dev: cd000001 lbp->state: 0 lbp->offset: ffffffff lbp->uPhysScript: 2a24000 From most recent interrupt: ISTAT: 06, SIST0: 04, SIST1: 00, DSTAT: 80, DSPS: 00000006 lsp: 1febc00 bp->b_dev: cd000001 scb->io_id: 57b13 scb->cdb: 08 00 00 08 00 00 lbolt_at_timeout: 7544517, lbolt_at_start: 7543017 lsp->state: 30d lsp->uPhysScript: 196e000 lsp->upScript: 196a000 lsp->upActivePtr: 196a000 lsp->uActiveAdjust: 0 lsp->upSavedPtr: 196a000 lsp->uSavedAdjust: 0 lsp->upPeakPtr: 196a000 lsp->uPeakAdjust: 0 lbp->owner: 1febc00 scratch_lsp: 0 Pre-DSP script dump [1b20020]: 78051800 00000000 78030000 00000000 0e000002 02a24700 80000000 00000000 Script dump [1b20040]: 9f0b0000 00000006 98080000 00000005 98080000 00000001 58000008 00000000



2-88

2–56. TEXT PAGE: ioscan

The ioscan command scans system hardware, usable I/O system devices, or kernel I/O system data structures as appropriate, and lists the results. For each hardware module on the system, ioscan displays the hardware path to the hardware module, the class of the hardware module, and a brief description. By default, ioscan scans the system and lists all reportable hardware found. The types of hardware reported include processors, memory, interface cards and I/O devices. Scanning the hardware may cause drivers to be unbound and others bound in their place in order to match actual system hardware. Entities that cannot be scanned are not listed. On very large systems, ioscan will operate much faster with the –k option. This will force ioscan to read kernel structures built at boot time, rather than sending fresh inquiries to each hardware module. Tool Source: HP


Interval: on demand

Data Source: SCSI devices

Type of Data: status and Hardware address

Metrics: hardware status


Overhead: minimal

Unique Feature: polls SCSI bus to retrieve status of SCSI devices

Full Pathname: /usr/sbin/ioscan

Pros and Cons: + Displays hardware addresses and corresponding device filenames. - Minimal performance data

Syntax

/usr/sbin/ioscan [-k|-u] [-d driver|-C class] [-I instance] [-H hw_path] \ [-f[-n]|-F[-n]] [devfile]

Examples # ioscan -f Class I H/W Path Driver S/W State H/W Type Description =========================================================================== bc 0 root CLAIMED BUS_NEXUS graphics 0 0 graph3 CLAIMED INTERFACE Graphics ba 0 2 bus_adapter CLAIMED BUS_NEXUS Core I/O Adapter ext_bus 0 2/0/1 c720 CLAIMED INTERFACE Built-in SCSI target 0 2/0/1.0 tgt CLAIMED DEVICE disk 0 2/0/1.0.0 sflop CLAIMED DEVICE TEAC FC-1 HF 07 target 1 2/0/1.1 tgt CLAIMED DEVICE tape 0 2/0/1.1.0 stape CLAIMED DEVICE HP HP35470A target 2 2/0/1.2 tgt CLAIMED DEVICE disk 1 2/0/1.2.0 sdisk CLAIMED DEVICE TOSHIBA CD-ROM XM-3301TA target 5 2/0/1.5 tgt CLAIMED DEVICE disk 4 2/0/1.5.0 sdisk CLAIMED DEVICE QUANTUM FIREBALL1050S target 6 2/0/1.6 tgt CLAIMED DEVICE



2-89

disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S target 7 2/0/1.7 tgt CLAIMED DEVICE ctl 0 2/0/1.7.0 sctl CLAIMED DEVICE Initiator lan 0 2/0/2 lan2 CLAIMED INTERFACE Built-in LAN hil 0 2/0/3 hil CLAIMED INTERFACE Built-in HIL tty 0 2/0/4 asio0 CLAIMED INTERFACE Built-in RS-232C tty 1 2/0/5 asio0 CLAIMED INTERFACE Built-in RS-232C ext_bus 1 2/0/6 CentIf CLAIMED INTERFACE Built-in Parallel Interface audio 0 2/0/8 audio CLAIMED INTERFACE Built-in Audio processor 0 8 processor CLAIMED PROCESSOR Processor memory 0 9 memory CLAIMED MEMORY Memory # ioscan -fC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S # ioscan -fnC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S /dev/dsk/c0t6d0 /dev/rdsk/c0t6d0



2-90

2–57. TEXT PAGE: vgdisplay, pvdisplay, lvdisplay

The vgdisplay command displays information about volume groups. If a specific vg_name is specified, information for just that volume group is displayed. The pvdisplay command displays information about specific physical volumes (or disks) within an LVM volume group. The lvdisplay command displays information about specific logical volumes within an LVM volume group. Tool Source: HP


Interval: on demand

Data Source: LVM header structures and /etc/lvmtab

Type of Data: LVM configuration

Metrics: mirroring, stripping, other I/O policies


Overhead: minimal

Unique Feature: shows LVM configuration information

Full Pathname: /usr/sbin/vgdisplay, /usr/sbin/pvdisplay, /usr/sbin/lvdisplay

Pros and Cons: + Only commands for viewing LVM configurations - Minimal tuning capabilities

Syntax

/sbin/vgdisplay [-v] [vg_name ...] /sbin/lvdisplay [-k] [-v] lv_path ... /sbin/pvdisplay [-v] [-b BlockList] pv_path ...

Examples

# vgdisplay --- Volume groups --- VG Name /dev/vg00 VG Write Access read/write VG Status available Max LV 255 Cur LV 9 Max PV 16 Cur PV 2 Max PE per PV 1016 VGDA 4 PE Size (Mbytes) 4 Total PE 726 Alloc PE 279



2-91

Free PE 447 Total PVG 0 # pvdisplay /dev/dsk/c0t5d0 --- Physical volumes --- PV Name /dev/dsk/c0t5d0 VG Name /dev/vg00 PV Status available Allocatable yes VGDA 2 Cur LV 7 PE Size (Mbytes) 4 Total PE 249 Free PE 0 Allocated PE 249 Stale PE 0 IO Timeout default # lvdisplay /dev/vg00/lvol1 --- Logical volumes --- LV Name /dev/vg00/lvol1 VG Name /dev/vg00 LV Permission read/write LV Status available/syncd Mirror copies 0 Consistency Recovery MWC Schedule parallel LV Size (Mbytes) 48 Current LE 12 Allocated PE 12 Stripes 0 Stripe Size (Kbytes) 0 Bad block off Allocation strict/contiguous



2-92

2–58. TEXT PAGE: swapinfo

The swapinfo command prints information about device and file-system paging space. This information includes reserved space as well as used swap space. NOTE: The term swap refers to an obsolete implementation of virtual memory;

HP-UX actually implements virtual memory by way of paging rather than swapping. This command and others retain names derived from "swap" for historical reasons.

Tool Source: HP


Interval: on demand

Data Source: kernel swap tables

Type of Data: swap space

Metrics: swap used, swap reserved, swap space configurations


Overhead: minimal

Unique Feature: Command can total all configured swap space into a one-line summary.

Displays pseudoswap information (if configured).

Full Pathname: /usr/sbin/swapinfo

Pros and Cons + provides valuable swap space configuration information - minimal documentation on psuedoswap

Syntax

/usr/sbin/swapinfo [-mtadfnrMqw]

Examples # swapinfo -t Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 159744 19868 139876 12% 0 - 1 /dev/vg00/lvol2 reserve - 51220 -51220 memory 42112 15300 26812 36% total 201856 86388 115468 43% - 0 -



2-93

2–59. TEXT PAGE: sysdef

The sysdef command analyzes the currently running system and reports on its tunable configuration parameters. Tool Source: HP


Interval: on demand

Data Source: /stand/vmunix and the currently running kernel

Type of Data: Tunable kernel parameters

Metrics: Current configuration of kernel parameters


Overhead: Minimal

Unique Feature: Shows current value and possible range of values

Full Pathname: /usr/sbin/sysdef

Pros and Cons: + Shows current setting of kernel parameters - reboot required to change most parameters

Syntax

/usr/sbin/sysdef [kernel [master]]

Example # /usr/sbin/sysdef NAME VALUE BOOT MIN-MAX UNITS FLAGS acctresume 4 - -100-100 - acctsuspend 2 - -100-100 - allocate_fs_swapmap 0 - - - bufpages 2841 - 0- Pages - create_fastlinks 0 - - - dbc_max_pct 50 - - - dbc_min_pct 5 - - - default_disk_ir 1 - - - dskless_node 0 - 0-1 - eisa_io_estimate 768 - - - eqmemsize 15 - - - file_pad 10 - 0- - fs_async 0 - 0-1 - hpux_aes_override 0 - - - maxdsiz 16384 - 256-655360 Pages - maxfiles 60 - 30-2048 - maxfiles_lim 1024 - 30-2048 - maxssiz 2048 - 256-655360 Pages - maxswapchunks 256 - 1-16384 - maxtsiz 16384 - 256-655360 Pages - maxuprc 75 - 3- - maxvgs 10 - - - msgmap 2555904 - 3- - nbuf 4788 - 0- - ncallout 292 - 6- -



2-94

ncdnode 150 - - - ndilbuffers 30 - 1- - netisr_priority -1 - -1-127 - netmemmax 5378048 - - - nfile 800 - 14- - nflocks 200 - 2- - ninode 476 - 14- - no_lvm_disks 0 - - - nproc 276 - 10- - npty 60 - 1- - nstrpty 60 - - - nswapdev 10 - 1-25 - nswapfs 10 - 1-25 - public_shlibs 1 - - - remote_nfs_swap 0 - - - rtsched_numpri 32 - - - sema 0 - 0-1 - semmap 4128768 - 4- - shmem 0 - 0-1 - shmmni 200 - 3-1024 - streampipes 0 - 0- - swapmem_on 1 - - - swchunk 2048 - 2048-16384 kBytes - timeslice 10 - -1-2147483648 Ticks - unlockable_mem 801 - 0- Pages - Name - The name of the parameter Value - The current value of the parameter Boot - The value of the parameter at boot time Min - The minimum allowed value of the parameter Max - The maximum allowed value of the parameter Units - The units by which the parameter is measured Flags - Further describe the parameter M Parameter may be modified without rebooting

A comparable command, introduced at HP-UX 11.00, is kmtune(1m).



2-95

2–60. TEXT PAGE: kmtune, kcweb

The kmtune command is used to query, set, or reset system parameters. kmtune displays the value of all system parameters when used without any options or with the -S or -l option. kmtune reads the master files and the system description files of the kernel and kernel modules. On 11i v2, kmtune is front-ended and will eventually be replaced entirely by kctune. kctune is part of a new, larger utility called kcweb. Tool Source: HP


Interval: on demand

Data Source: /stand/vmunix and the currently running kernel

Type of Data: Tunable kernel parameters

Metrics: Current configuration of kernel parameters


Overhead: Minimal

Unique Feature: Works with dynamic and static kernel modules

Full Pathname: /usr/sbin/kmtune

Syntax

/usr/sbin/kmtune [-l] [[-q name] . . ] [-S system file] /usr/sbin/kmtune [[-s {+|=}value] . . ] [[-r name] . . ] [-S system file]

Examples # /usr/sbin/kmtune

Parameter Value =================================================================== NSTRBLKSCHED 2 NSTREVENT 50 NSTRPUSH 16 NSTRSCHED 0 . . . # /usr/sbin/kmtune –l -q maxdsiz Parameter: maxdsiz Value: 0x04000000 Default: 0x04000000 Minimum: - Module: - Version: - (11i only) Dynamic: - (11i only)



2-96

2–61. SLIDE: Application Profiling and Monitoring Tools (Standard UNIX)

Student Notes This slide shows the standard UNIX application profiling performance tools included with HP-UX. Application profiling tools provide in-depth details regarding the execution of a program, including the number of times each subroutine is called and the amount of time spent in each subroutine.

Application Profiling and Monitoring Tools (Standard UNIX)

NoDefine and measure response timeof transactions for an application

arm

NoEnhanced Application Profilergprof

NoApplication Profilerprof


Resource



2-97

2–62. TEXT PAGE: prof, gprof

The prof and gprof tools are used to ascertain the library routines being called during the execution of a program. The prof utility profiles the execution of an application by displaying the names of the routines being called, the number of times the different routines were called, and how much time was spent in each routine. The gprof utility is an enhanced version of prof. It shows all the information available with prof, plus it displays a call graph tree, which details the call hierarchy of the routines. The call graph tree allows the parent routines, which called the children routines to be viewed. Tool Source: standard UNIX (System V)


Interval: on demand

Data Source: kernel routines called by the application

Type of Data: function call flow

Metrics: time spent in each function, number of times function was called

Logging: binary file mon.out

Overhead: significant delays in the execution of the application

Unique Feature: shows the flow of the function calls

Full Pathname: /usr/bin/prof

Pros and Cons: + shows where an application is spending its time - requires access to source code - requires application to be recompiled

Syntax

prof [-tcan] [-ox] [-g] [-z] [-h] [-s] [-m mdata] [prog] gprof [options] [a.out [gmon.out...]]

Examples

cc -p prog.c -o program ./program prof program cc -G prog.c -o program ./program gprof program



2-98

2–63. TEXT PAGE: Application Response Measurement (ARM) Library Routines

Description

The ARM library routines allow you to define and measure response time of transactions in any application that uses a programming language that can call a 'C' function. The ARM library is named "libarm" and is provided in two versions, an archive version and a shared library version. It is strongly recommended that you use the shared (sometimes referred to as dynamic) library version. In-depth discussion of this product is beyond the scope of this class. NOTE: arm is a cross-platform tool and functionally replaces the ttd discussed in the

next section. glance and gpm work equally well with either arm or ttd. Documentation: man 3 arm

Interval: configurable

Platforms supported: HP-UX, IBM AIX, Sun Solaris, NCR

Pros and Cons: + Integrates with PerfView/MWA and other distributed management/monitoring tools. - requires source code modification

Syntax: The six function calls used by arm are:

arm_init Return a unique ID based on application and user.

arm_getid Return a unique ID based on a transaction name.

arm_start Mark the beginning of a specific transaction.

arm_update Provide information or show progress of a specific transaction.

arm_stop Mark the end of a specific transaction.

arm_end Mark the end of an application.



2-99

2–64. SLIDE: Application Profiling and Monitoring Tools (HP-Specific)

Student Notes This slide shows some HP-specific application profiling tools included with HP-UX. Currently, the Transaction Tracker (ttd), and caliper are available for monitoring application behavior and performance. In 10.20, there was a tool called puma which came with all standard programming language compilers (like C, Pascal, and Fortran). The puma tool allowed profiling data to be collected without having to modify the application source code, or recompiling the application (in many cases). puma has been excluded from the more recent releases of HPUX. The Transaction Tracker allows a programmer to time how long a program is spending within a certain area of code. The Transaction Tracker requires the source code be modified to include the starting point and the stopping point. The Transaction Tracker is included as part of the MeasureWare/OVPA product. Transaction Tracker is HPUX specific. arm (discussed earlier) is the generic version of the Transaction Tracker. caliper is thread-aware, MP-aware, and features an easy command-line interface.

Application Profiling and Monitoring Tools (HP-Specific)

Nocaliper is a runtime performance analyzer for programs compiled with C, C++ and Fortran 90 compilers on Itanium systems.

caliper

NoTracks how much time is spent between specific lines of code in a program

ttd


Description



2-100

2–65. TEXT PAGE: Transaction Tracker

Description

The Transaction Tracker is a set of function calls that allow a programmer to time the execution of a particular body of code (referred to a “transaction”). The function calls are inserted into the source code to mark where a particular transaction begins and ends. Glance and gpm can then be used to monitor how many times the transaction is called, and how long it takes for the transaction to complete. Tool Source: HP

Documentation: MeasureWare Users manual

Interval: Every time Transaction Tracker function call is invoked within program

Data Source: The ttd process

Type of Data: Application execution times

Metrics: Times to one hundredth of a second

Logging: Binary file /var/opt/perf/datafiles/logtrans

Overhead: Medium to large, depending on number of transactions being timed.

Unique Feature Shows amount of time spent in a particular body of code

Full Pathname: Function calls defined in /opt/perf/include/tt.h

Pros and Cons: + Integrated with glance and gpm; makes it easy to monitor how long transactions take.

- Cannot be used within shell programs; C programs only (or programs which can call C routines).

Syntax

The four function calls used by Transaction Tracker are: tt_getid Names the transaction and returns a unique identifier. tt_start Signals the start of a unique transaction. tt_end Signals the end of the transaction. tt_abort Ends the transaction without recording times for the transaction.



2-101

2–66. TEXT PAGE: caliper — HP Performance Analyzer

Description

HP Caliper is a general-purpose performance analysis tool for applications on Itanium®-based HP-UX systems. HP Caliper allows you to understand the performance of your application and to identify ways to improve its run-time performance. HP Caliper works with any Itanium-based binary and does not require your applications to have any special preparation to enable performance measurement. The two primary ways to use HP Caliper are:

• As a performance analysis tool.

• As a profile based optimization (PBO) tool invoked by HP compilers.

The latest version of HP Caliper is available on the HP Caliper home page. You can find it at the http://www.hp.com/go/hpcaliper/ site.

Overview

HP Caliper helps you dynamically measure and improve the performance of your native Itanium-based applications in three ways:

• Commands to measure the overall performance of your program.

• Commands to drill down to identify performance parameters of specific functions in your program.

• A simple way to optimize the performance of your program based on its specific execution profile.

HP Caliper does not require special compilation of the program being analyzed and does not require any special link options or libraries. HP Caliper selectively measures the processes, threads, and load modules of your application. An application's load modules are the main executable and all shared libraries it uses. HP Caliper uses a combination of dynamic instrumentation of code and the performance monitoring unit (PMU) in the Itanium processor. HP Caliper uses the least-intrusive method available to gather performance data.

Supported Target Programs

HP Caliper includes support for:

Programs compiled for Itanium- and Itanium 2-based systems. HP Caliper does not measure programs compiled for PA-RISC processors.

Code generated by native and cross HP aC++, C++ and Fortran compilers, including inlined functions and C++ exceptions.

Programs compiled with optimization or debug information, or both. This includes support for both the +objdebug and +noobjdebug options.



2-102

• Both ILP32 (+DD32) and LP64 (+DD64) programs, both 32-bit and 64-bit ELF formats.

• Archive-, minshared- or shared-bound executables.

• Both single- and multi-threaded applications, including MxN threads.

• Applications that fork() or vfork() or exec() themselves or other executables.

• Shell scripts and the programs they spawn.

Features

HP Caliper is simple to run because it uses a single command for all measurements. You specify the type of measurement and the target program as command-line arguments. For example, to measure the total number of CPU cycles used by a program named myprog, just type:

caliper total_cpu myprog HP Caliper features include:

• Multiple performance measurements, each of which can be customized through configuration files.

• All reports are available in text format, comma-delimited (CSV) format, and most reports are also available in HTML format for easier browsing.

• Performance data can be correlated to your source program by line number.

• Easy inclusion and exclusion of specific load modules, such as libc, when measuring performance.

• Both per-thread and aggregated thread reports for most measurements.

• Performance data reported by function, sorted to show hot spots.

• Support for multi-process selection capabilities.

• The ability to save performance data in files that you can use to aggregate data across multiple runs to generate reports without having to re-run HP Caliper.

• The ability to attach and detach to running processes for certain measurements.

• The ability to restrict PMU measurements to specific regions of your programs.

• Limited support for dynamically generated code.



2-103


Student Notes To summarize this module, there are many performance tools for many different purposes. The objective of this module was to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool worked. In general, you should become most familiar with these tools: sar vmstat top glance/gpm (if available) These tools will tend to be your most commonly used tools. Other tools will tend to be useful in more specialized situations. Remember, never try to rely on just one tool to do everything. No tool will tell you everything. And every tool will mislead you somewhere down the line. No tool is perfect. That’s why you need to be familiar with multiple tools.

Summary

• Different categories of performance tools• Standard UNIX tools versus HP-specific tools• Separately purchasable tools• Kernel register-based tools versus midaemon-based Tools



2-104

2–68. LAB: Performance Tools Lab

Student Notes The goal of this lab is to gain familiarity with performance tools. A secondary goal is to get familiar with the metrics reported by the tools, although they will be explored in depth during the next days.

Directions

Set up: Change directories to: # cd /home/h4262/tools Execute the setup script: # ./RUN Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion.

Lab

Before we continue with a more focused discussion of glance and gpm, lets spend some time exploring the generic UNIX and HP-UX-specific tools discussed so far.

As you answer the following questions, try to categorize each tool as to its type and scope.



2-105

1. How many processes are running on the system?

Which tools can you use to determine this? 2. Are there any real-time priority processes running? If so, list the name and priority. What

tools can you use to determine this? 3. Are there any nice'd processes on the system? If so, list the name and priority for each.

What tools can you use to determine this? 4. Are there any zombie processes on the system? If so, how many are there? What tools can

you use to determine this? 5. What is the length of the run queue? What are the load averages? What tools can you use

to determine this?



2-106

6. How many system processes are running? What tools can you use to determine this?

NOTE A system process is defined as a process whose data space is the kernel's data space (such as swapper, vhand, statdaemon, unhashdaemon, and supsched). ps reports their size as zero.

There are three ways this can be determined. If you get stuck on this question, move on. Don't spend more than a few minutes trying to answer this question.

7. What percentage of time is the CPU spending in different states? What tools can you use to determine this?

8. What is the size of memory?

What is the size of free memory? What tools can you use to determine this?

9. What is the size of the swap area(s)?

What is the percentage of swap utilization? What tools can you use to determine this?



2-107

10. What is the size of the kernel’s incore inode table? How much of the inode table is utilized? What tools can you use to determine this?

11. Are there any CPU-bound processes running (processes using a “lot” of CPU)?

If so, what is the name of the process? What steps did you take to determine this?

12. Are there any processes running which are using a “lot” of memory? (A "lot" is relative,

i.e. a large RSS size compared to other processes.) If so, what is the name of the process? What steps did you take to determine this? Is memory utilization changing?

13. Are there any processes running which are doing any disk I/O? If so, what is the name of

the process? What steps did you take to determine this? What are the I/O rates of the disk bound processes? What files are open by this (these) process(es)?

NOTE: No processes are really doing a lot of physical disk I/O. However,

lab_proc3 is doing a LOT of logical I/O.



2-108

14. What is the current rate of semaphore or message queue usage? What tools can you use to determine this?

15. Is there any paging or swapping occurring? What tools can you use to determine this? 16. What is the system call rate? What tools can you use to determine this? 17. What is the buffer cache hit ratio? What tools can you use to determine this? 18. What is the tty I/O rate? What tools can you use to determine this? 19. Are there any traps (interrupts) occurring? What tools can you use to determine this?



2-109

20. What information can you collect about network traffic? What tools can you use to

determine this? 21. What information can be gathered on CPUs in an SMP environment?

What tools can you use to determine this? 22. What information can be gathered on Logical Volumes?

What tools can you use to determine this? 23. What information can be gathered on Disk I/O? What tools can you use to determine

this? 24. Shut down the simulation by entering: # ./KILLIT



2-110


3-1

Module 3 GlancePlus

Objectives


• Compare GlancePlus with other performance monitoring/management tools.

• Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).

Module 3 GlancePlus


3-2

3-1. SLIDE: This Is GlancePlus

Student Notes GlancePlus is a performance monitoring diagnostic tool. GlancePlus software visually gives you the useful, accurate information you need to pinpoint potential or existing problems involving your system’s CPU, memory, disk, or network utilization. To help you monitor and interpret your system’s performance data, GlancePlus software includes a rules-based adviser. Whenever threshold levels for measurements such as CPU utilization or disk I/O rates are exceeded the adviser notifies you with on-screen alarms. The adviser also applies rules to key performance measurements and symptoms and then gives you information to help you uncover bottlenecks or other performance problems. NOTE: GlancePlus is integrated into OpenView Windows at the menu bar level.

This is GlancePlus

• Motif-based interface that offers exceptional ease-of-learning and ease-of-use• State-of-the-art, award-winning on-line Help system.• Rules-based diagnostics that use customizable system performance rules to

identify system performance problems and bottlenecks.• Alarms that are triggered when customizable system performance thresholds are

exceeded.• Tailor information gathering and display to suit your needs.• Integrated into OpenView environments.

Features

• Get detailed views of CPU, disk, and memory resource activity• View disk I/O rates and queue lengths by disk device to determine if your disk

loads are well balanced• Monitor virtual memory I/O and paging• Measure NFS activity• And much more ...

Capabilities

Module 3 GlancePlus


3-3

GlancePlus offers a viewpoint into many of the critical resources that need to be measured in the open system environment.

Benefits

• Save time and effort managing your system resources

• Better understand your computing environment

• Satisfy your end users’ system performance needs quickly

• Leverage from a standard interface across vendor platforms

The features in the product yield a performance monitoring diagnostic solution that offers many benefits to the user. GlancePlus offers a tool that will make your analysis activities easier and quicker to perform. This will save you time. The display of various types of information will also allow you to get a better understanding of your own environment. The same GUI on the Motif version is used on all the supported platforms, which provides a leverage point for a standard user interface across several UNIX platforms. Many times, just by cursory use of the product, people will discover certain things about their systems. You do not have to have a performance problem to use GlancePlus. This simple cursory use of the product has let many people gain a better understanding of their systems. This helps out when a problem does exist. Knowing what is normal can help identify what has become abnormal in your environment.

Module 3 GlancePlus


3-4

3-2. SLIDE: GlancePlus Pak Overview

Student Notes The view here is from the heights. For our purposes, we will focus our discussion on the capabilities of glance and gpm and the information and reports they can produce from a running HP-UX system. Also understand that GlancePlus may be used in conjunction with MeasureWare/OVPA to enhance and extend its capabilities. Many of you may have purchased glance in the GlancePlus Pak, which includes a license to run glance, gpm and to configure and run the MeasureWare/OVPA Agent (mwa) on your system. The GlancePlus and MeasureWare/OVPA Agent products can be purchased separately or combined in the GlancePlus Pak. The Pak also includes (as of C.03.58.00 June 2002 application release) some event monitoring and graphical configuration components.

GlancePlus Pak Overview

Performance data collection and alarming

Online performance monitoring and diagnostic

Performance analysis and correlation

Forecasting and capacity planning

Central alarm monitoring and event management

GlancePlus

cent

ral m

anag

emen

t sy

stem

man

aged

no

de

PerfViewPerfViewPlanner

PerfViewAnalyzer

PerfViewMonitor

MeasureWare

NETWORKS SYSTEMS INTERNET APPS DATABASES

Module 3 GlancePlus


3-5

The components share a common measurement infrastructure, thus metrics, as well as applications have similar alarming mechanisms.

Complete information on the configuration and use of MWA/OVPA and PerfView/OVPM are fully covered in the Hewlett-Packard Education Services' course:

PerfView MeasureWare (catalog number B5136).

GlancePlus Pak

GlancePlus

Interfaces include: /opt/perf/bin/gpm

/opt/perf/bin/glance

MeasureWare/OVPA

Interfaces include:/opt/perf/bin/extract/opt/perf/bin/utility

PerfView/OVPM

Interfaces include:/opt/perf/bin/pv

Module 3 GlancePlus


3-6

3-3. SLIDE: gpm and glance

Student Notes GlancePlus provides dual user interfaces: The gpm GUI The glance Character Mode See history of activity of the system with multiple window capability Monitor your system while doing other work Use alarms, symptoms and color to assist with monitoring

Monitor performance remotely over slow datacom line When no high resolution monitor is available Creates less load on the system being monitored

gpm and glance

Module 3 GlancePlus


3-7

Notes on starting the user interfaces: gpm and glance Starting the GUI # gpm [options]

Starting the character based interface: # glance [options]

-nosave Do not save the current configuration at the next exit

-rpt Specify one or more

additional report windows -sharedclr Share color scheme with

other applications -nice Set the gpm nice value Xoptions Use X-Toolkit options

such as -display

-j interval Preset the number of seconds between screen refreshes

-p dest Specify the continuous

print option destination. -lock Allows glance to lock

itself into memory -nice Set the glance nice value

Module 3 GlancePlus


3-8

3-4. SLIDE: glance — The Character Mode Interface

Student Notes With glance you can run on almost any terminal or workstation, over a serial interface and relatively slow data communication links, and with lower resource requirements. The default Process List screen is shown in the above screen capture, and provides general data on system resources and active processes. In addition, the user may “drill down” to more specific levels of detail in areas of CPU, memory, disk I/O, network, NFS system calls, swap, and system table screens. Specific details on a per-process level are also available through the individual process screens. For your convenience, the next two pages contain a hot key quick reference guide for the glance character mode interface.

glance — The Character Mode Interface

Module 3 GlancePlus


3-9

Glance Hot Key Quick Reference

Top Level Screen Hot Keys

Hot Key Screen Displayed/Description a CPU By Processor c CPU Report d Disk Report g Process List i I/O By File System l Network By Interface m Memory Report n NFS By System t System Tables Report u I/O By Disk v I/O By Logical Volume w Swap Space A Application List B Global Waits D DCE Global Activity G Process Threads H Alarm History I Thread Resources J Thread Wait K DCE Process List N NFS Global Activity P PRM Group List T Transaction Tracker Y Global System Calls Z Global Threads ? Commands Menu

Module 3 GlancePlus


3-10

Hot Key Screen Displayed/Description b Scroll page backward f Scroll page forward h Online HELP j Adjust refresh interval o Adjust process threshold p Print toggle (start|stop auto-printing) e/q Quit GlancePlus r Refresh the current screen y Renice a process z Reset statistics to zero > Display next logical screen < Display previous logical screen ! Invoke a shell

Miscellaneous Screen Hot Keys

Hot Key Screen Displayed/Description S Select a NFS system/Disk/Application/Trans/Thread s Select a single process F Process Open Files L Process System Calls M Process Memory Regions R Process Resources W Process Wait States

Secondary Level Screen Hot Keys

Module 3 GlancePlus


3-11

3-5. SLIDE: Looking at a glance Screen

Student Notes Above is an example of an easy and common performance problem — a runaway looping process. Why is the global CPU utilization < 100%, although the sum of the individual process CPU utilizations > 100 %? Hint: Is this a UP or MP system? Also note that / (slashes) are used in glance reports to separate current metric values from cumulative averages. NOTE: For the record there were two CPUs on this system.

Looking at a glance Screen

Module 3 GlancePlus


3-12

On a three-way multiprocessor system with two processes in the same application looping, each process can use nearly 100% of each of 2 CPU’s. Over a 10-second interval, each uses nearly 10 seconds of CPU time, so the application used nearly 20 seconds of CPU time in 10 seconds of elapsed time. Process CPU utilization is 100% for each of the 2 looping processes, but global CPU utilization would be 66%. On HP-UX 11.0, processes can have multiple threads, each of which can consume CPU time independently of the others. On a four-way MP system, with one process that has three threads looping, the process as a total uses 300% of the CPU. The application and global CPU utilization would report the CPU utilization at 75%.

Module 3 GlancePlus


3-13

3-6. SLIDE: gpm — The Graphical User Interface

Student Notes gpm presents the same metrics as character-mode glance in graphical form. Significant global metrics, as well as bottleneck adviser symptom status and alarms are shown in the main window. The process list, as well as other reports, is available via menu selections. The process list is very customizable (and customizations are preserved) with filters, sorting, highlights, chosen metrics, and column rearrangement. The online User’s Guide is very useful. The ? button on every window is a shortcut into the on-item help, which is useful especially for metric definitions.

gpm — The Graphical User Interface

Module 3 GlancePlus


3-14

This is another screen shot of the gpm interface.

Note the icon reflecting an adviser alarm.

Module 3 GlancePlus


3-15

3-7. SLIDE: Process Information

Student Notes The Process Information screen in gpm presents the user with detailed information on each active process (including CPU utilization, disk I/O data, memory usage, wait state reasons, open() file information, and so on). This screen also allows the user to select a specific process and "drill down" to greater detail via the Reports selection menu.

Resource Diagnostic Monitoring

GlancePlus provides an abundant set of performance metrics to help analyze the current system. Careful thought and consideration have been given to ensure that the proper metrics are displayed. The product with its Motif GUI offers a way to efficiently display performance information, without overloading the customer with screen after screen of detailed data.

Process Information

Process InformationDetailed data on each active processCPU dataDisk I/O dataMemory UseWait ReasonsOpen Files

Process FeaturesAccess via Main Reports selection Process ListEach Process has:

–Process Resources–Open Files

Module 3 GlancePlus


3-16

Customizable GUI

GlancePlus uses the power of Motif and its industry-leading approach to display technology, to provide the user with a powerful graphical user interface that can be customized to fit your needs. Fonts, color, window size and more are configuration options. Additional configuration choices are available in "list" windows to allow easy manipulation of column tabular data for display and sort uses. The gpm Process List and GlancePlus - Main screen provide a pull-down menu to access the numerous, detailed Report screens. These reports allow a logical approach to the extensive amount of system resources and process specific data.

▼ Resource History Window ▼ CPU Info ▼ Memory Info ▼ Disk Info ▼ Network Info ▼ System Info ▼ Global Info ▼ Swap Space ▼ Wait States ▼ Transaction Tracking ▼ Application List ▼ PRM Group List ▼ Process List ▼ Thread List

Next Level contains additional graphs and tables

Module 3 GlancePlus


3-17

3-8. SLIDE: Adviser Components

Student Notes GlancePlus supports performance alarms and a rules-based adviser to help automate the interpretation of performance data. The alarm rules can be customized by the user to reflect local system characteristics. Note: Both interfaces will report alarms, and the same syntax is used for alarms in glance and gpm. Alarms are configured through the /var/opt/perf/advisor.syntax file.

Adviser Components

Adviser Windows – Symptom History– Symptom Status/Snapshot– Alarm History– Adviser Syntax

Button Label Colors– Alarm Button for Alarm Statements– Graph Buttons for Symptom Statements

Icon Border Color (in OpenView)– Changes to Red or Yellow on Alarms

Module 3 GlancePlus


3-18

3-9. SLIDE: adviser Bottleneck Syntax Example

Student Notes The bottleneck alarms are a little complex. The CPU bottleneck symptom definition and corresponding alarm is shown. Just because a resource is fully utilized doesn’t mean that it is a bottleneck. It is only a bottleneck if there is activity that is hindered waiting for that resource. Therefore, utilization alone is not a good bottleneck indicator. Both utilization and queue lengths are combined to define the symptom probability. Some of the key metrics for performance analysis are the ones we use in the default syntax to define bottleneck alarms.

adviser Bottleneck Syntax Example

# The following symptoms are used by the default Alarm Window# Bottleneck alarms. They are re-evaluated every interval and# the probabilities are summed. These summed probabilities are# checked by the bottleneck alarms. The buttons on the gpm# main window will turn yellow when a probability exceeds 50%# for an interval, and red when a probability exceeds 90% for# an interval. You may edit these rules to suit your environment:

symptom CPU_Bottleneck type=CPUrule GBL_CPU_TOTAL_UTIL > 75 prob 25rule GBL_CPU_TOTAL_UTIL > 85 prob 25rule GBL_CPU_TOTAL_UTIL > 90 prob 25rule GBL_PRI_QUEUE > 3 prob 25

alarm CPU_Bottleneck > 50 for 2 minutesstart

if CPU_Bottleneck > 90 thenred alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"

elseyellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"

repeat every 10 minutesif CPU_Bottleneck > 90 then

red alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"else

yellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"end

reset alert "End of CPU Bottleneck Alert"

Module 3 GlancePlus


3-19

3-10. SLIDE: The parm File

Student Notes By now you are starting to see the range and scope of the performance metric data that glance and gpm display. While this is invaluable when it comes to understanding the behavior of a single process, many times what we really need is to evaluate and baseline the performance of an entire application suite. This could be achieved by adding up the individual metrics of all processes within the application suite, but this could be a daunting task for all but the simplest of applications. Through the use of the configuration file /var/opt/perf/parm, glance and gpm can help to collect all the metrics from the individual processes within an application suite and present the information in a concise manner for your review. One challenge is in the definition of what constitutes an application. To address this issue, the parm file has several different methods for describing which processes belong to which application definition. Application member processes can be defined by their UID, the front-store file from which they were fork()'d , the priority at which they execute, their GID, or any combination of the above. This provides a very versatile framework for application profiling.

The parm File

parm file application definitions are used by both GlancePlus and MeasureWare. A .parm in a user's $HOME directory will override the system parm file.

application = and the associatedparameters defines the logicalgroupings used to define eachapplication on the machine.

!Examples:application=Real Timepriority=0-127

application=Prog Dev Group 1file=vi,xdb,abb,ld,lintuser=bill,debbie

application=Prog Dev Group 2file=vi,xdb,abb,ld,lintuser=ted,rebecc,test*

application=Compilersfile=cc,ccom,pc,pascomp

application =user =file =priority =group =

application =user =file =priority =group =

Module 3 GlancePlus


3-20

NOTE: glance and gpm share the same application definitions (via the parm configuration file) as mwa.

The order in which applications are defined is very important. Once a process meets the definition of an application, its data will be contributed to that application's metrics. Care must be taken to assure that ambiguity is avoided in the definition of applications.

# /var/opt/perf/parm for host system “garat” id = garat # Parameters for what data classes scopeux will log: log global application process dev=disk,lvm transaction # Parameters to control maximum size of scopeux logfiles: size global=10, application=5, process=2, device=1, transaction=1.5 # Thresholds which determine what process data scopeux will log: threshold cpu = 1, disk = 1, nonew, nokilled # Web server: application = WWW user = www or file = httpd # Untrustworthy users: application = HighRisk user = fred,barney,root

Module 3 GlancePlus


3-21

3-11. SLIDE: GlancePlus Data Flow

Student Notes Without going into a lot of detail, note that both interfaces share a common instrumentation source and common application definitions. Instrumentation comes partly from interfaces also accessed by standard UNIX utilities such as vmstat, and partly from special HP-UX KI trace-based instrumentation. There is no generally available API to these interfaces. They are written specifically for use by GlancePlus and MeasureWare/OVPA.

GlancePlus Data Flow

KIHP-UX kernel

midaemon

Shared Memory

glance gpm

Terminal display Adviser output Motif display

parm file

(application definitions)

Adviser definitions

Module 3 GlancePlus


3-22

Significant Directories

/opt/perf Product files from installation media /opt/perf/bin Executables /opt/perf/ReleaseNotes Release Notes /opt/perf/examples Supplementary configuration examples /opt/perf/paperdocs Electronic versions of documentation /var/opt/perf Product and configuration files created during and after

installation Always check ReleaseNotes for version-specific information. (New for C.02.30 and later releases: example configuration files) Config files come from /opt/perf/newconfig if they don’t already exist under /var/opt/perf. Compare new default parm file with that on your system if you are updating from a previous release. The directory /var/opt/perf contains the status and data files.

Module 3 GlancePlus


3-23

3-12. SLIDE: Key GlancePlus Usage Tips

Student Notes

Key GlancePlus Usage Tips

• Use it for “What’s going on right now.”• The gpm online help is very useful — especially on item help.• Drill down from higher level reports to more detailed resource reports.• Understand what the adviser is telling you.• Sort, filter, and choose metrics in gpm; especially the Process List.• In character-mode glance use:

– ? screen to navigate – h for help– o screen for setting thresholds and process list sorting

• Edit the adviser alarms to be right for you.• Adjust update interval to control CPU overhead.• Process details including thread lists, wait states, memory regions,

open files, and system call reports can be used to impress your programming staff ! 8^)

Module 3 GlancePlus


3-24

3-13. SLIDE: Global, Application, and Process Data

Student Notes It is important to understand the interrelationships among metric classes.

Global, Application, and Process Data

• Global metrics reflect system-wide activity (sum of all applications).

• Process metrics reflect specific per-process (including thread) activity.

• Application metrics sum activity for a set of processes. They keep track of activity for all processes, however short-lived, even if they are not reported individually.

• Glance updates all metric values at the same time. MeasureWaresummarizes Global, Application, and other class data over 5-minute intervals and summarizes Process data over 1-minute intervals.

• Multiprocessor effects: Global and Application CPU percentages reflect normalization over the number of processors (percentage of availability for entire system). Process and thread-level CPU percentages are not normalized by the number of processors.

Module 3 GlancePlus


3-25

3-14. SLIDE: Can't Solve What's Not a Problem

Student Notes One of the hardest skills is to determine what to measure and how to interpret its significance. After all, if the user’s response time is satisfactory, then oftentimes there is “no problem” even if an operation metric is higher than normal.

Can’t Solve What’s Not a Problem!

• A looping process by itself is not a problem.• Know what’s “normal” for your environment. • Keep historical performance data for reference.• Measure response times.• Use the tools to find out what is affecting performance.• Isolate bottlenecks and address them when there is a problem.• When tuning, make only one change at a time and then measure its

effect.• Document everything you do!• Optimize your time resource: don’t fix what isn’t broken; sometimes

more hardware is the cheapest answer; set yourself up to react quicker next time.

Module 3 GlancePlus


3-26

3-15. SLIDE: Metrics: "No Answers without Data"

Student Notes CPU utilization and disk I/O rates compare well on different summarization intervals, whereas CPU time and I/O counts are always larger when the collection interval grows. Examples of breakdowns: Global disk I/O rate is a sum of the BYDSK_ metrics, each class in turn breaks down activity between reads and writes and file system versus raw and system access. For disk bottlenecks, it is often useful to correlate between DSK, FS, and LV classes. Memory utilization is frequently nearly 100% with dynamic buffer cache. If page outs occur or while in raw disk access environments, shrink the buffer cache to avoid paging. Programmers frequently don’t know they can view specific system-call metrics, as well as memory region and open file information on a per-process basis.

Metrics: “No Answers without Data”

• Rate and utilization metrics are more useful than counts and times, because they are independent of the collection interval.

• Cumulative metrics measure over the total duration of collection.

• Most metrics are broken down into subsets by type. Work from the top down.

• Blocked states reflect individual process or thread wait reasons. Global queue metrics are derived from process blocked states.

• CPU is a “symmetric” resource. Scheduler will balance load on the multiprocessor, whereas disks and network interface activity depend on where data is located.

• Memory utilization is not as important as paging activity and buffer cache sizing.

Module 3 GlancePlus


3-27

3-16. SLIDE: Summary

Student Notes Remember that performance tuning is an art, and the following two rules apply to most engagements: Rule #1: When answering a question about computer system performance, the initial

answer is always, “It depends.” Rule #2: Performance tuning always involves a trade-off.

Suggested reading: HP-UX Tuning and Performance by Robert F. Sauers and Peter S. Weygant, available through the Hewlett-Packard Professional Books, Prentice Hall Press (ISBN 0-13-102716-6)

Summary

• Don’t try to understand all the capabilities and extensions to the tools, just the ones of most use to you.

• Start with developing an understanding of what is “normal” on your systems.

• Refine and develop alarms customized for your environment.

• Work from examples in documentation, gpm online help, config files, and example directories.

Module 3 GlancePlus


3-28

3-17. SLIDE: HP GlancePlus Guided Tour

Student Notes To take the guided tour of GlancePlus, run the gpm GUI and select Help on the menu bar. Next, select the Guided Tour option. This will introduce you to the product. It features captured “windows” of the actual product, with annotations to help point out the important features of certain screens or windows. Quick Tip: gpm provides an excellent online Help system. Click the right mouse button for the On-Item Help feature. For help in glance, press the h key.

HP GlancePlus Guided Tour

cpu

process

memory Topics• Main Window• CPU Bottlenecks• Memory Bottlenecks• Configuration Information• Alarm and Symptoms

Module 3 GlancePlus


3-29

3-18. LAB: gpm and glance Walk-Through

Directions

The following lab is intended to familiarize the student with gpm and glance. To achieve this result, the lab will “walk the student through” a number of windows and tasks in both the ASCII and X-Windows versions of gpm and glance.

The Graphical Version GlancePlus

1. Log in. If you have not already done so, please log into the system with the user name and password provided by your instructor.

2. Start GlancePlus. From a terminal window, invoke GlancePlus by entering gpm.

# gpm In a few seconds gpm will come up. The first thing will be a license notification informing you that you are starting a trial version of GlancePlus, along with ordering and technical support information. On the gpm Main screen, you will see four graphs for CPU, Memory, Disk, and Networking. By default, the graphs are in the resource history format. This means that for each interval (configurable) there will be a data point on the graph, up to the maximum number of intervals (also configurable).

3. Interval Customizations. Click on Configure in the menu bar, and select Measurement. Set the sample interval to 10 seconds and the number of graph points to 50. This will allow you to see up to 500 seconds of system history. Click on OK.

NOTE: This setting will be saved for you in your home directory in a file called

$HOME/.gpmhp-system_name. This means that all GlancePlus users will have their customizations saved.

Start a program from another window:

# cd /home/h4262/cpu/lab1; # ./RUN &

4. Main Window. Below each graph within the GlancePlus Main window, you will find a button. These buttons display the status color of adviser symptoms. This is a powerful feature of GlancePlus that we will investigate later. Clicking on one of these buttons displays details of that particular graph.

To view the advisor symptoms from the main window, select:

Adviser -> Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.

Module 3 GlancePlus


3-30

View CPU details:

Click the CPU button.

To view a detailed report regarding the CPU, select:

Reports -> CPU Report

Select:

Reports -> CPU by Processor

This is a useful report, even on a single processor system.

5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? .

Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal.

6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own.

An alarm is simply a notification that a symptom has been detected. From the main window, select:

Adviser -> Symptom History

For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window.

7. Process Details. Close all windows except for the main window. Select:

Reports -> Process List

This shows the “interesting” processes on the system (interesting in terms of size and/or activity). To customize this listing, select:

Configure -> Choose Metrics This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related

Module 3 GlancePlus


3-31

metrics available in GlancePlus. Note that the familiar ? button is also available from this window.

Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK.

8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select

Configure -> Sort Fields

The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select

Configure -> Arrange Columns

Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order.

9. More Customizations. It is possible to modify the definition of interesting processes by selecting:

Configure -> Filters

An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows:

Change Enable Filter to ON Change Filter Relation to >=

Module 3 GlancePlus


3-32

Change Filter Value to 3.0 Change Enable Highlight to ON Change Highlight Relation to >= Change Highlight Value to 3.0 Change Highlight Color to any LOUD color

Reset the logic condition make to OR, then click OK. Verify the filter took effect.

10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced.

In the Process List window, select the proc8 process. To access the Admintools, select:

Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select:

Admin -> Kill Click OK, and note the process is no longer present.

11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process.

Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated.

After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows.

12. Exit GlancePlus. From the Main window, select:

File -> Exit GlancePlus

Module 3 GlancePlus


3-33

13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance.

NOTE: Never run glance or gpm in the background.

If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window… # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface.

14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system.

Retrieve online help related to this window by typing h, which brings up a help menu. Select:

Current Screen Metrics

Use the cursor keys to select

CPU Util

NOTE: This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the “Page Down” key to toggle to the next page of help.

Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select:

Screen Summaries

Use the cursor keys to select Global Bars

From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.

Module 3 GlancePlus


3-34

15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen.

Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads.

16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports.

Feel free to use this time to ask the instructor "How Do I . . .?" types of questions.

Glance GlancePlus (gpm) COMMAND FUNCTION "REPORT" *a All CPUs Performance Stats CPU by Processor b Back one screen *c CPU Utilization Stats CPU Report *d Disk I/O Stats Disk Report e Exit f Forward one screen *g Global Process Stats Process List h Help *i I/O by Filesystem I/O by Filesystem j Change update interval *l Lan Stats Network by LAN *m Memory Stats Memory Report *n NFS Stats NFS Report o Change Threshold Options p Print current screen q Quit r Redraw screen *s Single process information Process List, double-click process *t OS Table Utilization System Table Report *u Disk Queue Length Disk Report,double-click disk *v Logical Volume Mgr Stats I/O by Logical Volume *w Swap Stats Swap Detail y Renice process Administrative Capabilities z Zero all Stats ! Shell escape ? Help with options <CR> Update screen data


4-1


Objectives


• Describe the components of a process.

• Describe how a process executes, and identify its process states.

• Describe the CPU scheduler.

• Describe a context switch and the circumstances under which context switching occurs.

• Describe in general, the HP-UX priority queues.



4-2

4–1. SLIDE: The HP-UX Operating System

Student Notes The main purpose of an operating system is to provide an environment where processes can execute. This includes scheduling processes for time on the CPU, managing the memory which is assigned to processes, allowing processes to read data from disk, and many other things. When processes execute within the HP-UX operating system, there are two modes that they can be in: User mode and Kernel (system) mode.

User Mode and Kernel Mode

User mode refers to instructions that do not require the assistance of the kernel program in order to execute. These include numeric calculations, string manipulations, looping constructs, and many others. In general, it is good when a process can spend the majority of its time in “user mode”, because it implies the CPU is executing instructions that are related to the process, as opposed to instructions related to the kernel. Kernel mode refers to time spent in the kernel executing instructions on behalf of the process. Processes access the kernel through system calls, often referred to as the System Call Interface. Examples include performing I/O, creating new processes, and expanding data space.

The HP-UX Operating System

User Level

Kernel Level( Gateway )

System Call Interface

File Subsystem

Buffer Cache

Character BlockI/O SubsystemDevice Drivers

Kernel Level

Hardware Level

Hardware Control Interface

Hardware Devices

Process

Control

Subsystem

Interprocess Communication

Scheduler

Memory Management



4-3

Kernel mode is also used for “background” activities, performed by the kernel on behalf of processes. Examples include page faulting the program's text or data in from disk, initializing and growing a process's data space, paging a portion of the process to swap space, performing file system reads and writes, and many other things. In general, when a process spends too much time in kernel mode, it is considered bad for performance. This is because too much time (overhead) is being spent to manage the environment in which the process executes, and not enough time on executing the actual process itself (which is user mode).

Performance Tools

Most all performance tools that track CPU utilization distinguish between time spent by the CPU in user mode versus time spent in kernel mode. On a good, healthy system with plenty of memory resources, a typical ratio between user mode and kernel mode time is 4:1. This means the process spends 75-80% of its execution in user mode and 20-25% in kernel mode. Another general rule of thumb is, kernel mode CPU time should not exceed 50%. When this happens, it generally means too much time is being spent managing the system (i.e. memory and swap space management, context switching), and not enough is being spent executing process code.



4-4

4–2. SLIDE: Virtual Address Process Space (PA-RISC)

Student Notes Each process views itself as starting at address 0 and ending at the maximum address addressable by 32 or 64 bits. This address space is known as the Virtual Address Space for a process. The virtual address space is a logical addressing scheme used internally by the process to reference related instructions and data variables. The physical memory address locations cannot be used, because a program does not know where in physical memory it will be loaded. In fact, a program could be loaded at different memory locations each time it executes.

The Four Quadrants (32-bit)

Each process segments its virtual address space into four quadrants, with each quadrant containing 1 GB of address space. The first quadrant is reserved for the program's instructions (also known as text). Though an address range of 1 GB is reserved for text, very rarely does the program need all these addresses. Most of the time, only a fraction (often less than 10%) of this space is needed to address the program's text.

Process Virtual Address Space (PA-RISC)

Text

Data

Shared Objects

Shared Objects

32-bit

Text

Data

Shared Objects

64-bit

Shared Objects

(1GB/quadrant)(4TB/quadrant)



4-5

The second quadrant holds the programs private data variables. Again, 1 GB of address space is reserved for data variables, and only a fraction of this space is used (in general). Since this quadrant is limited to 1 GB of address space, a maximum global data size of approximately 900 MB is imposed. (In HP-UX, changes were made to allow the global data to use addresses in other quadrants for private data, thereby increasing its maximum size to 3.9 GB.) The third and fourth quadrants are usually used to address shared memory segments, shared text segments, shared memory-mapped files, and other shared structures, such as the System Call Interface.

64-Bit HP-UX 11.00 Update

With the introduction of HP-UX 11.00 and its 64-bit operating system, the virtual address space changes dramatically. A 32-bit process running under the 64-bit kernel is given the same space allocations as under a 32-bit kernel. With a 64 bit process, the addressable space increases to 16 Terabytes. This limits each quadrant to 4 TB (for a total of 16 TB of virtual address space), but the capability exists to increase this address space, if necessary, in future releases. Notice also that the locations of the various components of the process have been shifted among the quadrants.



4-6

4–3. SLIDE: Virtual Address Process Space (IA-64)

Student Notes There is no 32-bit kernel running on the IA-64 processor. The virtual address space is always 16 EB in size, although it may not all be used or allocated while a particular process is running. The space is divided into eight equal-sized octants – each octant is 2 EB in size. When executing a PA-RISC 32-bit process, the first four octants are set up just like the PA-RISC, 32-bit virtual address space, using only 1 GB out of each octant to simulate the four original quadrants. The last octant holds the kernel and all of its related structures.

64-Bit Processes

With a 64-bit process, the virtual address space changes dramatically. The first two octants become the equivalent to the first PA-RISC quadrant and hold shared objects. The third octant holds the text. The fourth and fifth octants are reserved for any process private data, and the sixth and seventh octants contain more shared objects. Only the last octant is laid out exactly the same for both 32-bit and 64-bit processes.

Process Virtual Address Space (IA-64)

Text

Data

Shared Objects

Shared Objects

32-bit

Text

Data

Shared Objects

64-bit

Shared Objects

(1GB/octant)(2EB/octant)

Kernel Kernel

Shared Objects

Shared Objects

Data

(2EB/octant)



4-7

4–4. SLIDE: Physical Process Components

Student Notes Each process executing in memory contains an entry in the kernel's process table. The entry in the proc table then references the locations of the program's four main components: text, data, stack, and uarea. The text segment contains the program's executable code. The data segment contains the programs' global data structures and variables. The stack area contains the programs' local data structures and variables. The uarea is an extension of the proc table entry. In a multi-threaded process, each thread will have its own uarea. Other components that may or may not be associated with a process are shared libraries, shared memory segments, and memory-mapped files. The text and initialized global data segments of the process are taken from the executed program file on disk during process startup. In an attempt to save on startup time, the uninitialized global data segments and the stack area are zero filled, and no pages of a program are loaded at startup. Copying the entire text and data into memory would generate long startup latency. This latency problem is avoided in HP-UX by demand paging the program's text and data as needed.

Physical Process Components

Memory

OS Tables

Kernel

Text

Data

Stack

UArea

MemMap

LibTxt

ShMem

Proc TableEntry



4-8

Using this demand paging approach, the program is loaded into memory in smaller pieces (pages), on an as-needed basis. One page on HP-UX 10.X is equal to a 4-K size. On HP-UX 11.00, the page size is variable (meaning the initial program could page in sizes greater than 4 KB).



4-9

4–5. SLIDE: The Life Cycle of a Process

Student Notes The life cycle of a process can be generalized by the above slide. When a process is born (or starts), its text must be paged in from the file system on disk (on demand) in order to be executed. (Remember, the operating system only pages in a text page when it determines that a process needs a particular page in order to execute.) In addition, space must be reserved on the swap partition for the process in the event it may need to page portions of the data area out to swap. Once the swap space is reserved and the process is initialized, the process can begin executing on the CPU. As the process executes, it often performs actions that require it to wait. These actions include reading data from the disk or the network, waiting for a user to enter a response at a terminal window, or waiting on a shared resource (like semaphores). Once the item, which the process is waiting on, becomes available, the process puts itself in the CPU run queue so it can begin executing again. This is the standard cycle that a process goes through: WAIT for a resource, enter the CPU run queue when the resource is available, execute on the CPU. The waiting on a resource is symbolized in the slide as the octagon (or stop sign). The entering of the CPU run queue is symbolized by the triangle, and the execution on the CPU is indicated by the CPU in the rectangle.

The Life Cycle of a Process

Process

CPUMain

Memory

CPU

CPUStarts

EndCache

StopDisk

CPUQueue

filesys filesys

Swap



4-10

An advantage of the glance performance tool is that it displays on a per process basis (or system-wide) the various reasons why a process is blocked or waiting on the CPU.



4-11

4–6. SLIDE: Process States

Student Notes The process table entry contains the process state. This state is logically divided into several categories of information to do the following: scheduling, identification, memory management, synchronization, and resource accounting. There are five major process states: SRUN The process is running or is runnable, in kernel mode or user mode, in

memory or on the swap device. SSLEEP The process is waiting for an event in memory or on the swap device. SIDL The process is being setup via fork. SZOMB The process has released all system resources except for the process table

entry. This is the final process state. SSTOP The process has been stopped by job control or by process tracing and is

waiting to continue.

Process States

USERMODESRUN

KERNELMODESRUN

SLEEP(IN

MEMORY)SSLEEP

ZOMBIE

SZOMB

STOP

SSTOP

IDLE

SIDL

RUNNABLE(IN

MEMORY)SRUN

Debugger or Job Control Stop

Go to kernel mode

Go touser mode

Exit

Wait on anevent

Wakeup, current completed

Wakeup, event completed

ContextSwitch

forkcompletes

SLEEP(SWAP

DEVICE) SSLEEP

RUNNABLE(SWAP

DEVICE)SRUN



4-12

Most processes, except the currently executing process, are placed in one of three queues within the process table: a run queue, a sleep queue, or a deactivation queue. Processes that are in a runnable state (ready for CPU) are placed on a run queue, processes that are blocked awaiting an event are located on a sleep queue, and processes that are temporarily out of the scheduling mix are placed on a deactivation queue. Deactivated processes typically only occur during a system memory management crisis. Processes either terminate voluntarily through an exit system call or involuntarily as a result of a signal. In either case, process termination causes a status code to be returned to the parent of the terminating process. This termination status is returned to the parent process using a version of the wait() system call. Within the kernel, a process terminates by calling the exit() routine. The exit(0) routine completes the following tasks: cancels any pending timers, releases virtual memory resources, closes open file descriptors, and handles stopped or traced child processes. Next, the process is taken off the list of active processes and is placed on a list of zombie processes, which is finally changed to being a no process state. The exit() routine continues to record the termination status in the proc structure, bundles up the process's accumulated resource usage for accounting purposes, and notifies the deceased process's parent. If a process in SZOMB state is found, the wait() system call will copy the termination status from the deceased process and then reclaim the associated process structure. The process table entry is taken off the zombie list and returned to the freeproc list. As of HP-UX 10.10, the concept of a thread was introduced into the kernel. Processes became an environment in which one (or more) threads could execute. Each thread was visible and manageable by the kernel separately. When this occurred, processes were in any of the following states: SINUSE The process structure is being used to define one or more threads. SIDL The process is being setup via fork. SZOMB The process has released all system resources except for the process table

entry. This is the final process state. Whereas threads now took on the previous states of the process: TSRUN The thread is running or is runnable, in kernel mode or user mode, in

memory or on the swap device. TSSLEEP The thread is waiting for an event in memory or on the swap device. TSIDL The thread is being setup via fork. TSZOMB The thread has released all system resources except for the thread table

entry. This is the final thread state.



4-13

TSSTOP The thread has been stopped by job control or by process tracing and is waiting to continue.

The generic UNIX tools have no awareness of threads and so they continue to report process states and all other metrics from the viewpoint of the process, Only the HP-specific tools (such as glance, gpm, PerfView/OVPM, and MeasureWare/OVPA) have the ability to look at individual threads and report their metrics separately from the process. Of course, the vast majority of processes are single-threaded. In those cases, there is no practical difference between the reports of the various tools.



4-14

4–7. SLIDE: CPU Scheduler

Student Notes Once the required data is available in memory, the process waits for the CPU scheduler to assign the process CPU time. CPU scheduling forms the basis for the multitasking, multiuser operating system. By switching the CPU between processes that are waiting for other events, such as I/O, the operating system can function more productively. HP-UX uses a round robin scheduling mechanism. The CPU lets each process run for a preset maximum amount of time, called a quantum or time slice (default = 1/10th second), until the process completes, or is preempted to let another process run. Of course, the process can always voluntarily surrender the CPU before its timeslice expires when it realizes that it cannot continue. The CPU saves the status of the first process in a context and switches to the next process. When a process is switched out due to its timeslice expiring, it drops to the bottom of the run queue to wait for its next turn. If it is preempted by a stronger priority process, it is placed back onto the front of the run queue. If it voluntarily gives up the CPU, it goes onto one of the sleep queues, until the resource it’s waiting for becomes available. When that resource does become available, the process moves the end of the run queue.

CPU Scheduler

The CPU scheduler handles:

• Context switches

• Interrupts

Memory

Proc Cpri=172

Proc Apri=156

Proc Bpri=220

Proc Dpri=186

CPU OS Tables

Kernel CPUScheduler



4-15

As a multitasking system, HP-UX requires some way of changing from process to process. It does this by interrupting the CPU to shift to the kernel. The clock interrupt handler is the system software that processes clock interrupts. It performs several functions related to CPU usage including gathering system and accounting statistics and signaling a context switch. System performance is affected by how rapidly and efficiently these activities occur.

Terms

CPU scheduler Schedules processes for CPU usage System clock Maintains the system timing Clock Interrupt Executes the clock interrupt code and gathers system accounting handler statistics Context switching Interrupts the currently running process and saves information about

the process so that it can begin to run after the interrupt, as if it had never stopped.



4-16

4–8. SLIDE: Context Switching

Student Notes A context switch is the mechanism by which the kernel stops the execution of one process and begins execution of another. A context switch occurs under the circumstances shown on the slide. There are two types of context switches: forced and voluntary. A forced context switch occurs when the process is forced to give up the CPU before it is ready. These include timeslice expiration or a stronger priority process becoming runnable. A voluntary context switch occurs when the process itself gives up the CPU without using its full timeslice. This happens when the process exits, or puts itself to sleep (waiting on a resource), or puts itself into a stopped state (debugging). The glance tool distinguishes between forced and voluntary context switches on a per process basis.

Context Switching

A context switch occurs when• A timeslice expires (a thread accumulates 10 clock ticks)

(Forced)• A preemption occurs (a stronger priority thread is runnable)

(Forced)- if the stronger thread is RT, immediate preemption- if the stronger thread is not RT, at next convenient time

• A thread becomes non-computable, i.e. - it goes to sleep- it is stopped- it exits(Voluntary)



4-17

4–9. SLIDE: Priority Queues

Student Notes Every process has a priority associated with it at creation time. These priorities determine the order in which processes execute on the CPU. Processes with the weakest priority number always execute before processes with stronger numbers. In UNIX, stronger priorities are represented by smaller numbers and weaker priorities are represented by larger numbers. HP-UX uses adjustable priorities to schedule its time slicing for general timeshare processes generated by all users (priorities 128-255). By that we mean, a process’s priority can be adjusted, up or down, by the kernel, according to how “favored” a process might be. In general, the more a process executes, the less favorable it will be treated by the kernel. However, since HP-UX also supports real-time processing, it must include priority-based scheduling for those processes (priorities 0-127). As of HP-UX 10.X, support is also provided for POSIX real-time processes (priorities -32 through -1). The /usr/include/sys/param.h file contains some extra information on the priorities used in the system. Each processor in an HP system has its own run queue. Each run queue is further broken down into multiple priority queues, to make it easier for that processor to select the most deserving process to run.

Priority Queues

-32 -1 0 1 2 127 131 155 175 179 183 255 128 - 152 - 172 - 176 - 180 - 252 -

. . . . . .

Real Time Priority Queues (1 priority wide) Time-shared Priority Queues (4 priorities wide)

. . . . . . . . .

Time (rtsched) HP-UX Real Time (rtprio) System Level Priorities User Level PrioritiesPOSIX Real

Signalable Priorities Nonsignalable Signalable Priorities

PSWP PZERO PUSER(128) (153) (178)



4-18

Real-Time Process Priorities

Real-time priority queues are one wide, i.e. each queue represents one priority value. The strongest priority real-time process preempts all others (of weaker priority) and runs until it sleeps or exits or is preempted by a stronger or timesliced by an equal real-time process. Equal priority real-time processes run in a round robin fashion. A process can be made to run with a real-time priority by using the rtprio(1) or rtsched(1) command. The rtsched command can also be used to disable timeslicing for a particular process, by assigning it a different scheduling policy. Because a real-time process will execute at the expense of all time-share processes, make sure that you consider the impact on your users before invoking the command. A CPU-bound, real-time process will halt all other use of the system. A POSIX real-time process (ttisr) runs on HP-UX at priority -32.

Time Share Process Priorities

Timeshare priority queues are four-wide, i.e. each priority queue represents four, adjacent priority values. For example, the first timeshare priority queue is used by processes with priorities of 128, 129, 130, and 131. Timeshare processes are grouped into system and user processes. Priorities 128-177 are reserved for runnable system processes and sleeping processes (both system and user), and priorities 178-255 are for runnable user processes. A nice value is assigned to a timeshare process, which will be used in the calculation of a new priority for the process. This value will be used by the kernel to help determine how to “adjust” the priority of the process. Nice values have no effect on real-time processes.



4-19

4–10. SLIDE: Nice Values

Student Notes Time shared processes are all initially assigned the priority of the parent when they are spawned. The user can make modifications to how much the kernel “favors” a process with the nice value. Timeshare processes lose priority as they execute, and regain priority as they wait their turns. The rate at which a process loses priority is linear, but the rate at which it regains priority is exponential. A process's nice value is used as a factor in calculating how fast a process regains priority. The nice value is the only control a user has to give greater or less favor to a time share process. The default nice value is 20. Therefore, to make a process run at a weaker priority, it should be assigned a higher nice value (maximum value 39). The superuser can assign a lower nice value to a process (minimum value 0), effectively giving it a stronger priority.

Nice Values

177

255

nice = 20

Priority

ProcB

ProcA

ProcA ProcA ProcA ProcA ProcARunning Sleeping Running Sleeping Running

ProcB ProcB ProcB ProcB ProcBRunning Sleeping Sleeping Sleeping Running

(ProcA nice=20)

(ProcB nice=39)

nice = 39



4-20

4–11. SLIDE: Parent-Child Process Relationship

Student Notes One item to keep in mind related to process management is the relationship between parent and child processes. Every process started from a terminal window on the system has a parent process that spawns it. The parent process does not terminate once a child is spawned. Instead, it goes to sleep waiting for the child to terminate from its execution. If a child process does not exit properly, for example, if it spawns a new process rather than exiting to its parent, then the system could end up with many processes sleeping in memory and using proc table entries unnecessarily. The example in the slide shows a ksh shell that spawns a sam process. Within sam, the system administrator shells out to su to a regular user. Once in the login shell, the user starts glance. From within glance, they shell out, and now decide they'd rather be in a csh shell. This string of events caused eight different processes to be started. If the user decides he wants to return to sam by typing sam, would the previous sam process be reactivated, or would a new sam process be spawned? (Answer: A new sam process is spawned).

Parent-Child Process Relationship

Memory

ksh sam

OS Tables

Kernel

ksh su

csh sh glance sh



4-21

4–12. SLIDE: glance — Process List

Student Notes The next four slides are designed to illustrate how the management of processes can be monitored through glance. Topics just covered (like kernel versus user CPU time, process components, process wait states, nice values, and process priorities) can all be viewed through glance. The first Global Bar graph, which displays on every glance screen, is the CPU Util. This displays how the CPU is being distributed.

• S = System or Kernel Time

• N = User Time (executing processes who have had their nice value set greater than 20. (21-39)

• U = User Time (executing processes with a nice value of 20)

• A = User Time (executing processes who have had their nice value set less than 20 (0 – 19). In other words: Anti-nice.

• R = Real Time (executing processes with priorities 127 and less)

glance – Process List

B3692A GlancePlus B.10.12 14:52:27 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------CPU Util | 22% 29% 51%Disk Util | 1% 7% 13%Mem Util | 91% 91% 91%Swap Util | 25% 24% 35%--------------------------------------------------------------------------------

PROCESS LIST Users= 11User CPU Util Cum Disk Thread

Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Count--------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1

Page 1 of 13

S S N NF

S S U U B BU U R R



4-22

The Process List screen (g key), as shown on the slide, can be used to see process priorities. The order in which the processes are displayed can be configured (o key) to display by CPU usage, memory usage, or disk I/O activity. In HP-UX version 10.X, the thread count column was the blocked on column. The blocked on information can still be obtained by looking at the individual processes’ resource summary screens.



4-23

4–13. SLIDE: glance — Individual Process

Student Notes From the Process List screen, an individual process can be selected for further analysis (s key). The above slide shows some of the additional details available when analyzing a process further. Items of interest from the Individual Process screen include the process's nice value, the number of Forced versus Voluntary context switches, the current Wait reason, and the Parent PID.

glance – Individual Process

B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------CPU Util | 22% 29% 51%Disk Util | 1% 7% 13%Mem Util | 91% 91% 91%Swap Util | 25% 24% 35%--------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab--------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEPUser/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mbSystem CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0Priority : 154 VM Reads : 0 Signals Recd : 339Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start TimeRunning CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998CPU Switches : 0 Bytes Xfer: 410kb

S S N NF

S S U U B BU U R R



4-24

4–14. SLIDE: glance — Process Memory Regions

Student Notes From the Individual Process screen, the memory regions (i.e. process components) corresponding to that process can be viewed (M key). The above slide shows the memory regions for the currently selected process. Items of interest from the Memory Region screen include the location of the process's Text, Data, Stack, and U-Area, along with its Shared/Private flag, its Resident Set Size and Virtual Set Size, and its reference count. If the process is associated with Memory Map files (MEMMAP), Shared Libraries (LIBTXT), or Shared Memory Segments (SHMEM), these will be displayed. In HP-UX version 11.X, glance no longer displays the addresses of each memory region. However, gpm still does.

glance – Process Memory Regions

B3692A GlancePlus B.10.12 10:17:41 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------CPU Util | 22% 29% 51%Disk Util | 1% 7% 13%Mem Util | 91% 91% 91%Swap Util | 25% 24% 35%--------------------------------------------------------------------------------Memory Regions for PID: 16013, netscape PPID: 14061 euid: 520 User:sohrab

Type RefCt RSS VSS Locked File Name--------------------------------------------------------------------------------NULLDR/Shared 64 4kb 4kb 0kb <nulldref>TEXT /Shared 3 4.3mb 9.5mb 0kb /opt/…/netscape-binDATA /Priv 1 5.8mb 8.6mb 0kb /opt/…/netscape-binMEMMAP/Priv 1 4kb 20kb 0kb /opt/…/netscape-binMEMMAP/Priv 1 36kb 36kb 0kb /opt/…/netscape-binMEMMAP/Priv 1 12kb 12kb 0kb <memmap>STACK /Priv 1 28kb 28kb 0kb <stack>UAREA /Priv 1 16kb 16kb 0kb <uarea>LIBTXT/Shared 85 56kb 60kb 0kb /usr/lib/dld/sl

Text RSS/VSS:4.3mb/9.5mb Data RSS/VSS:5.8mb/8.6mb Stack RSS/VSS: 28kb/ 28kbShmem RSS/VSS: 0kb/ 0kb Other RSS/VSS:4.1mb/5.7mb

S S N NF

S S U U B BU U R R



4-25

4–15. SLIDE: glance — Process Wait States

Student Notes From the Process List screen, the process wait states can be viewed (W key). The above slide shows the categories of wait states and where/what the selected process has waited on. Items of interest from the Process Wait State screen include the percentage of time the process has spent in each of the possible wait state categories.

glance – Process Wait States

B3692A GlancePlus B.10.12 10:23:03 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------CPU Util | 22% 29% 51%Disk Util | 1% 7% 13%Mem Util | 91% 91% 91%Swap Util | 25% 24% 35%--------------------------------------------------------------------------------Wait States for PID: 14205, netscape PPID: 14061 euid: 520 User:sohrab

Event % Blocked On %--------------------------------------------------------------------------------IPC : 0.0 Cache : 0.0 CPU Util : 13.7Job Control: 0.0 CDROM IO : 0.0 Wait Reason: SLEEPMessage : 0.0 Disk IO : 0.0Pipe : 0.0 Graphics : 0.0RPC : 0.0 Inode : 0.0Semaphore : 0.0 IO : 0.0Sleep : 77.2 LAN : 0.0Socket : 0.0 NFS : 0.0Stream : 0.0 Priority : 9.1Terminal : 0.0 System : 0.0Other : 0.0 Virtual Mem: 0.0

C - cum/interval toggle % - pct/absolute toggle Page 1 of 1

S S N NF

S S U U B BU U R R



4-26

4–16. LAB: Process Management

Directions

The following lab is designed to manage a group of processes. This includes observing the parent-child relationship and modifying process nice values (and thus indirectly priorities) with the nice/renice command .

Modifying Process Priorities

This portion of the lab uses glance to monitor and modify nice values of competing processes.

1. Change directory to /home/h4262/baseline.

# cd /home/h4262/baseline

2. Start seven long processes in the background.

# ./long & ./long & ./long & ./long & ./long & ./long & ./long &

3. Start a glance session. Answer the following questions.

How much CPU time is each long process receiving? _______sec, _______% How are the processes being context switched (forced or voluntary)? _______________ How many times over the interval is the process being dispatched? ____________ What is the ratio of system CPU time to user CPU time? __________ What are the processes being blocked on? _________________ What are the nice values for the processes? _________

4. Select one of the processes and favor it by giving it a more favorable nice value.

What is the PID of the process being favored? __________ To change the processes nice value, enter: # renice –n -5 <PID of selected process> Watch that process’s percentage of the CPU over several display intervals with glance or top. What effect did it have on the process? _____________________________ ____________________________________________________________________



4-27

5. Select another long process and set the nice value to 30. # renice –n 10 <PID of another selected process> What effect did that have on that process? ___________________________________ ______________________________________________________________________

6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with:

# kill $(ps –el | grep long | cut –c18-22)



4-28


5-1


Objectives


• Describe the components of the processor module.

• Describe how the TLB and CPU cache are used.

• List four CPU related metrics.

• Identify how to monitor CPU activity.

• Discuss how best to use the performance tools to diagnose CPU problems.

• Specify appropriate corrections for CPU bottlenecks.



5-2

5–1. SLIDE: Processor Module

Student Notes A typical HP processor module consists of a central processing unit (CPU), a cache, a translation lookaside buffer (TLB), and a coprocessor. These components are connected via internal processor busses, with the entire processor module being connected to the system bus. The cache is made up of very high-speed memory chips. Cache can be accessed in one CPU cycle. Its contents are instructions and data that recently have been or are anticipated to be used soon by the CPU. Cache size varies between processors. The size of the cache can have a big effect on system performance. The translation lookaside buffer (TLB) is used to translate virtual addresses into physical addresses. It is a high-speed cache whose entries consist of pairs of recently accessed virtual addresses and their associated physical addresses, along with access rights and an access ID. The TLB is a subset of a system-wide translation table (page directory) that is held in memory. TLB size also affects system performance, and different HP 9000 processors have different TLB sizes.

Processor Module

CPU

TLB Cache Coprocessor

System Bus



5-3

The address translations kept in the TLB enable us to locate the appropriate data and instructions in the memory. The memory is accessed via the physical address. Without the translation in the TLB, we would not be able to find the information in the memory. Note these other points regarding the TLB:

• Each process has a unique virtual address space.

• Each TLB entry refers to a page of memory, not a single location. In all 64-bit architectures used by HP, pages are fundamentally 4KB in size, but can be any multiple of 4K under various circumstances – to reduce the number of entries needed in the TLB.



5-4

5–2. SLIDE: Symmetric Multiprocessing

Student Notes Symmetric Multiprocessing (SMP) refers to systems containing two or more processor units. SMP is implemented on all Hewlett-Packard workstations and servers capable of supporting more than one CPU. Each processor on an SMP system has exactly the same characteristics, including the same processing unit, the same CPU cache design, and the same size translation lookaside buffer (TLB).

Symmetric Multiprocessing

CPU


CPU


System Bus



5-5

5–3. SLIDE: Cell Module

Student Notes A more recent design of HP systems is based on the “cell” architecture. In a cell, there are multiple processors, some memory and some I/O buses. Each cell could act as an independent SMP system, or as part of a collection of cells, forming a larger SMP system. Each processor in the cell has the same access speed (or latency) to the memory within the same cell. However, if one of those processors would have to access a location in the memory of a different cell, the latency would be greater. Each processor within the cell does have its own cache memory and TLB. Each processor has equal access to the I/O buses that are part of the same cell. They may also have access (with somewhat greater delays) to the I/O of other cells in the same system.

Cell Module

Cell Internal Bus

Processor

Processor

Processor

ProcessorMemory

I/O Buses



5-6

5–4. SLIDE: Multi-Cell Processing

Student Notes The best example HP currently has of a SMP using cell architecture is the Superdome. Here we find 4 cells, each with four processors, some memory, and some I/O buses. Each cell could be configured (using Node Partitioning or NPars) into a separate and individual system – capable of booting its own operating system. It would be functionally apart from the other cells. The only way that the operating system on that cell could communicate with the software running on any other cell would be through a network interface. On the other hand, multiple cells could be configured to act as a unit. They would pool their resources and boot a single operating system. They would seamlessly act as a SMP system. This architecture gives the customer and the system administrator tremendous flexibility in how to set up their hardware. They could even change it relatively easily from one configuration to another as their needs changed. On a wider range of systems, you may be using Virtual Partitioning (VPars). There are similar to NPars, but are not limited to cell boundaries and are handled entirely by software. A system could use both NPars and VPars at the same time. Using software, processors can be moved from one VPar to another.

Multi-Cell Processing

Memory

P P P P

Memory

P P P P

Memory

P P P P

Memory

P P P P

High-Speed Memory interconnect

I/O

I/O

I/O

I/O



5-7

Finally, on an even wider range of systems, we have the concept of processor sets (psets). Multiple psets could exist within the same partition (either NPar or VPar). Each pset would be set aside for use by a particular application of group of applications. Using software, psets can be created and removed, and processors could be moved from one pset to another.



5-8

5–5. SLIDE: CPU Processor

Student Notes The CPU ultimately is responsible for your system speed. The kernel loads the process text for the CPU to execute. The processor module has many Registers, which assist in the execution of instructions. The definition of all these registers is beyond the scope of this course. The primary objective of this module is to focus on CPU clock speed, the size of the CPU cache, and the effects of the TLB related to overall system performance. Each HP 9000 server and workstation has a chip at its heart. The latest version PA-RISC chip is the 64-bit, PA-8xxx. HP has also introduced systems using the 64-bit, IA-64 Itanium chip. A selection of the range of current systems is listed on the following pages. Note the difference not only in clock speeds, but also in cache size. The following tables list the specifics of several HP-UX servers and workstations. It is very difficult to keep a list of this nature up to date in training materials but it has been included merely to demonstrate the wide variety of system characteristics present in the HP computing products family.

CPU Processor

CPU


ShadowRegisters

SpaceRegisters

GeneralRegisters

ControlRegisters

Process Status Word

InstructionAddress Queues

Special FunctionUnitRegisters

CoprocessorRegisters



5-9

Business Servers

Model No. of CPUs Clock Speed Max. RAM

(GB)

Cache (KB) I/O Slots

rp3410-2

(PA-8800)

2 800 MHz 6 1.5MB(L1)

32MB(L2)

2-PCI (64-bit)

rp3440-4

(PA-8800)

4 1 GHz 24 1.5MB(L1)

32MB(L2)

4 PCI (64-bit)

rp4440-8

(PA-8800)

8 1 GHz 64 1.5MB(L1)

32MB(L2)

6 PCI

rp7420-16

(PA-8800)

16

(2 cells)

1 GHz 64 1.5MB(L1)

32MB(L2)

15 PCI

rp8420-32

(PA-8800)

32

(4 cells)

1 GHz 128 1.5MB(L1)

32MB(L2)

16 PCI

Superdome

(PA-8800)

128

(16 cells)

1 GHz 1024 1.5MB(L1)

32MB(L2)

192 PCI

rx1600

(Itanium 2)

2 1 GHz 16 1.5MB(L3) 0/1/1 PCI *

rx2600

(Itanium 2)

2 1.5 GHz 24 6MB(L3) 0/4/0 PCI *

rx4640

(Itanium 2)

4 1.5 GHz 64 6MB(L3) 0/4/2 PCI *

rx5670

(Itanium 2)

4 1.5 GHz 96 6MB(L3) 0/6/3 PCI *

rx7620

(Itanium 2)

8

(2 cells)

1.5 GHz 64 6MB(L3) 15 PCI

(128-bit)

rx8620

(Itanium 2)

16

(4 cells)

1.5 GHz 128 6MB(L3) 16+16 PCI

(128-bit)

Superdome

(Itanium 2)

64

(16 cells)

1.5 GHz 512 6MB(L3) 0/128/64 PCI *



5-10

Workstations

Model No. of CPUs Clock Speed Max. RAM

(GB)

Cache (KB) I/O Slots

B2600

(PA-8600)

1 500 MHz 4 512/1024 2/2/0 PCI *

B3700

(PA-8700)

1 750 MHz 8 768/1536 2/3/1 PCI *

C3750

(PA-8700+)

1 875 MHz 8 768/1536 2/3/1 PCI *

J6750

(PA-8700+)

2 875 MHz 16 768/1536 0/0/3 PCI *

zx2000

(Itanium 2)

1 1.4 GHz 8 1536(L3) 5 PCI - 1 AGP

zx6000

(Itanium 2

2 1.5 GHz 24 6144(L3) 3 PCI – 1 AGP

* 2/3/1 means 2 32-bit PCIs, 3 64-bit PCIs and 1 128-bit PCI. All Itanium 2 processors include 32KB of L1 cache and 256KB of L2 cache. To determine the specifics of your system, refer on-line to http://www.hp.com/go/enterprise, select "Products Index" and scroll down to select your system platform name [i.e. J-Class (HP 9000)]. This will display the "Product Information" screen for the selected hardware.



5-11

5–6. SLIDE: CPU Cache

Student Notes The CPU loads instructions from memory and runs multiple instructions per cycle. To minimize the time that the CPU spends waiting for instructions and data, the CPU uses a cache. The cache is a very high-speed memory that can be accessed in one CPU cycle with the contents being a subset of the contents of main memory. As the CPU requires instructions and data, they are loaded into the cache. The size of the cache has a large bearing on how busy the CPU is kept. The larger the cache, the more likely it is that it will contain the instructions and data to be executed. Most current processors support multi-level caches. The Level 1 cache (L1) is the fastest – operating at the same speed as the CPU. It is relatively small. The Level 2 cache (L2) operates at one-half the speed of the CPU. It is somewhat larger. The IA-64 has a Level 3 cache (L3) that is even larger and slower.

CPU Cache

CPU


System Bus

Memory

ProcessText

xxxx

Instructionto Execute

xxxx



5-12

5–7. SLIDE: TLB Cache

Student Notes All 32-bit programs view their address space as starting at address 0, and ending at address 4 GB. All addresses referenced by the program are referenced relative to this address space. This is referred to as the program's virtual address space. A program's physical address is the address location in physical memory where the program is loaded at execution time. When the CPU executes a program, it is presented with the virtual address containing the instruction to be executed. In order to fetch this instruction from physical memory, the CPU must convert the virtual address (VA) into the corresponding physical address (PA). To do this, the CPU checks the TLB. If the VA->PA is present, it then knows the PA in memory of the instruction. If the VA is not present, it then needs to fetch the information from the PDIR (Page DIRectory) table in memory. This memory fetch of the PDIR table is relatively expensive from a performance standpoint. Once the PA is known, the CPU then checks the Instruction Cache on the CPU for the PA. If the PA is present, it then loads the instruction straight from Instruction Cache. If not present, it then needs to fetch the instruction from memory, which is relatively expensive (performance-wise).

TLB Cache

CPU


System Bus

Memory

ProcessText

xxxx

Instructionto Execute

xxxxVA | PA

0...4GB

InstructionAddress Queues

VA\PAPage Directory



5-13

The size of the TLB is anywhere from 96 to 160 entries (each entry points to a variable-sized memory page) on a PA-RISC and an IA-64.



5-14

5–8. SLIDE: TLB, Cache, and Memory

Student Notes The slide shows some of the permutations of hits and misses on memory, cache, and the TLB, as well as the consequences of each. The best situation is when the VA has an entry in the TLB, and the corresponding PA has an entry in the CPU cache. This allows the instruction or data to be present to the CPU in one clock cycle. The next-best scenario, would be to have a hit on the TLB, but a miss on the CPU cache. An example number of clock cycles to fetch a PA from memory to the CPU cache is 50 clock cycles. Another scenario would be to have a miss on the TLB, but a hit on the CPU cache. The miss on the TLB requires the PDIR table in memory to be searched, and an appropriate entry to be loaded into the TLB. This takes a variable number of cycles to perform. On one model the average was 131 clock cycles. Therefore, a miss on the TLB is more expensive than a miss on CPU cache. A miss on both the TLB and the CPU cache would translate into 131 + 50 or 181 clock cycles on average to access the instruction or data that the CPU needs. This could have been accessed in 1 clock cycle had the VA been in TLB, and the PA been in CPU cache.

TLB, Cache, and Memory

TLB

Hit

Hit

Miss

Miss

Cache

Hit

Miss

X

X

Memory

Hit

Hit

Hit

Miss

Consequence

1 CPU cycle fetch

Data/instructionmemory fetch

PDIR memory fetch

Page fault

X = Don’t Care



5-15

The worst scenario, performance-wise, is not having the instruction or data loaded in memory at all. In this case, a page fault would occur to retrieve the information from disk. Assuming a 1-GHz clock, a 10-ms disk transfer rate, and an idle disk drive, this would correspond to 10,000,000 clock cycles to access the data or instruction.



5-16

5–9. SLIDE: HP-UX — Performance Optimized Page Sizes

Student Notes HP-UX 11.00 is the first release of the operating system to have general support for performance optimized page sizes (POPS), also known as variable page sizes. Partial support for variable memory page sizes has existed since HP-UX 10.20. HP-UX 11.00 allows customers to configure executables to use specific performance optimized page sizes, based on the program's text and data sizes. Page sizes can be selected from a range of 4 KB to 4 GB. The use of performance optimized page sizing can significantly increase performance of applications that have very large data or instruction sets. NOTE: Performance-optimized page sizing works on PA-8000-based and IA-64-based

systems.

Fixed Page Sizes (Prior to 11.00)

Prior to HP-UX 11.00, all page sizes were fixed at 4 KB. As a program executed, each 4 KB page would be mapped into physical memory, and a TLB entry would be created to map the virtual address corresponding to that page to the physical memory address. Selected models

HP-UX 11.00 — Performance Optimized Page Sizes (POPS)

Filesystem

File

Filesystem

FileMemory

MemoryTLB (on CPU)

TLB (on CPU)

0 819216384 512000

VA PA

0 81924096 655368192 128000

12288 25600016384 512000

VA PA

VA0

40968192

1228816384

VA0

16384

0

8192

65536

128000

256000

512000

0

8192

512000

HP-UX 10.x

Fixed PageSize – 4 KB

HP-UX 11.00

Variable PageSize Range: 4 KB – 64 MB



5-17

had a few “Block” TLB entries, which could map multiple pages into a single entry, if the pages were contiguous in both virtual and physical address spaces. These entries were reserved for mapping the kernel, the I/O pages, and other segments that were locked into memory. At some point, the TLB would become full, and the virtual-to-physical address mapping would only be stored in the PDIR table in memory, not in the TLB on the CPU. This meant that if a virtual address needed to be translated, there would be a chance that the address would not have an entry in the TLB, and time would have to be spent to look up the address within the PDIR table in memory. This handling of the TLB miss was expensive in terms of performance.

Performance Optimized Page Sizes (11.00 and Beyond)

With the release of HP-UX 11.00, support for variable page sizes is available. With POPS, a larger portion of the process's virtual address space can be referenced within a single page or within a few, large pages. Therefore, a larger portion of the process can be referenced with much fewer TLB entries. Below are two tables showing what sizes of pages are available in the PA-RISC and the IA-64 architectures. PA-RISC IA-64 4K 4K - 8K 16K 16K 64K 64K 256K 256K 1M 1M 4M 4M 16M 16M 64M 64M 256M 256M 1G - - 4G

Affecting Page Sizes

There are two methods of affecting page size in a process. One is through tunable kernel parameters. vps_pagesize determines what the “default” page size will be with no other information. The size is given in 1K units and the setting is typically 4. vps_ceiling determines how large the kernel can “promote” a page size for a process, if it notices that a process is getting a very large number of TLB misses. The default setting for this is 16 (1K). The second method is done by the system administrator. A command, chatr, can be used to provide the kernel with a hint of what page sizes would work best with this process. Following is an example of this command.

chatr –pi 16 –pd 256 /opt/app/bin/app The above command would hint to the kernel that this process would best execute with 16K pages for the instructions (text) and 256K pages for the data. This hint would be stored in the



5-18

header of the executable file and be visible to the kernel whenever the program was invoked. The kernel would do its best to see that the hint is followed. However, if memory pressure exists, the kernel may not be able to honor the request and may end up “demoting” the size of the page to be able to manage it in memory. There is a third tunable parameter, vps_chatr_ceiling, that determines the maximum value a chatr command can assign to an executable file.



5-19

5–10. SLIDE: CPU — Metrics to Monitor Systemwide

Student Notes The load on the CPU can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor CPU performance.

User CPU Utilization

This is the percentage of time the CPU spent running in user mode. This corresponds to executing code within user processes, as opposed to code within the kernel. It is better to see user CPU utilization higher than system CPU utilization (preferably two to three times higher).

Nice/Anti-Nice Utilization

This is the percentage of time the CPU spent running user processes with nice values of 21-39 (Nice) or 0-19 (Anti-Nice). This is typically included in USER CPU utilization, but some tools, like glance, track this separately to see how much CPU time is being spent on weaker or stronger priority processes.

CPU — Metrics to Monitor Systemwide

• User CPU utilization• Nice/Anti-Nice utilization• Real time processes• System CPU utilization• System call rate• Context switch rate• Idle CPU utilization• CPU run queues (load averages)



5-20

Real Time Processes

This is the amount of time spent executing real time processes that are running on the system. Real time processes get the CPU immediately when they are ready to execute, and can have a big impact on the performance of time-shared processes.

System CPU Utilization

This is the percentage of time the CPU spent running in system (or kernel) mode. This corresponds to executing code within the kernel. We have to have some kernel time just to do minimum management on the system. However, excessive time spent managing the system is bad for performance. Excessive system CPU utilization is considered to be when system utilization is greater than the user utilization.

System Call Rate

The system call rate is the rate at which system calls are being generated by the user processes. Every system call causes a switch to occur between user mode and system (or kernel) mode. A high system call rate typically corresponds to a high system CPU utilization. If the system call rate is high, it is recommended to investigate which system calls are being generated, the frequency of each system call, and the average duration of each system call.

Context Switch Rate

This is the number of times the CPU switched processes (on average) per second. This is typically included in system CPU utilization, but some tools, like glance, track this separately.

Idle CPU

This is the percentage of time the CPU spent doing nothing (i.e. it did not execute any user or kernel code). It is good to see some, even lots, of idle CPU time. A non-idle CPU means the CPU run queue is never exhausted (or emptied), which means processes are always having to wait before reaching the CPU. The size of the line (CPU run queue) grows, as idle CPU time approaches 0.

CPU Run Queues/Load Average

Both these terms reference the same thing. This is the number of processes in the CPU run queue. For best performance, the average load in the CPU run queue should not exceed three.



5-21

5–11. SLIDE: CPU — Metrics to Monitor per Process

Student Notes Individual processes vary greatly in terms of the load they place on the CPU. Metrics to monitor on an individual process include the following.

Process Priority

This is the priority of the process. If the priority is 127 or less, we know it is a real time process. If the priority is 128-177, either it is a system process, or it is a user process that is sleeping. If the priority is 178-255, then we know the process is executing in USER mode.

Process Nice Value

This is the nice value associated with the process. This only applies to time-share processes. This value determines how fast the process regains priority while it is waiting for the CPU. Small nice values (0-19) should be given to more important processes allowing them to regain priority quickly. Large nice values (21-39) should be given to less important processes, causing them to regain priority slowly.

CPU — Metrics to Monitor per Process

• Process priority• Process nice value• Amount of CPU user time• Amount of CPU system time



5-22

User CPU Time vs. System CPU Time

This is the percentage of time the individual process spent in user mode (i.e. having the CPU execute user code) and system mode (i.e. having CPU execute kernel code). This is helpful in determining where the CPU spends its time when executing the process: user code or kernel code. It is generally desirable to see more time in user code.



5-23

5–12. SLIDE: Activities that Utilize the CPU

Student Notes Examples of activities that place a load on the CPU include the following.

System Activities

System activities are those activities which execute in kernel mode. Examples of system activities include system processes and user processes executing system calls.

• Process startup

• Process scheduling

• File system and raw I/O

• Memory management

• Handling of system calls

Activities that Utilize the CPU

• Process management• File system I/O• Memory management activities• System calls• Applications (for example, CAD-CAM and database processes)• Batch jobs



5-24

User Activities

User activities are those activities that execute in user mode.

• CAD/CAM applications

• Database processing

• Client/server applications

• Compute-bound applications

• Background jobs (i.e. batch jobs)



5-25

5–13. SLIDE: glance — CPU Report

Student Notes The glance CPU report (c key) provides details on where the CPU is spending its time from a global perspective.

• User mode: This is time spent by the CPU in user mode for all processes on the system. This includes processes with a nice value of 20 (user), processes with nice values between 21-39 (nice), processes with nice values between 0-19 (negative nice), and real-time priority processes.

• System mode: This is time spent by the CPU in system mode for all processes on the system. It includes time spent handling general system calls (system), and time spent handling interrupts, context switches, traps, and Vfaults (virtual faults).

• Load Average: This is the number of jobs in the CPU run queue averaged over three time intervals. It includes the average length of the run queue over the last 1 minute, the last 5 minutes, and the last 15 minutes. The CPU load average data is viewable on page 2 of this glance report. Also on page two are the System Call Rate, the Interrupt Rate, and the Context Switch Rate.

glance — CPU Report


CPU REPORT Users= 4State Current Average High Time Cum Time--------------------------------------------------------------------------------User 18.9 6.0 32.3 0.96 3.61Nice 0.0 2.4 5.7 0.00 1.47Negative Nice 0.4 0.8 16.2 0.02 0.51RealTime 0.4 0.4 0.7 0.02 0.22System 3.3 7.0 16.2 0.17 4.21Interrupt 1.8 1.7 2.7 0.09 1.02ContextSwitch 0.6 0.7 1.4 0.03 0.40Traps 0.0 0.0 0.0 0.00 0.00Vfaults 0.0 0.7 3.6 0.00 0.45Idle 74.6 80.2 91.2 3.79 48.18

Top CPU user: PID 2097, dthelpview 19.5% cpu utilActive CPUs: 1 Page 1 of 2

S S N NF

S S U U B BU U R R



5-26

5–14. SLIDE: glance — CPU by Processor

Student Notes The glance CPU-by-processor report (a key) provides details on a per CPU basis.

CPU Utilization: This is the CPU utilization for the specific processor. If two or more processors exist on the system, the Global CPU Util bar graph shows an average CPU utilization. That is, a CPU that is 100% utilized and a second CPU that is 0% utilized will display 50% CPU utilization. This report displays utilization on a per processor basis.

Load Average: This is the number of processes, on average, in the CPU run queue over the last 1 minute, 5 minutes, and 15 minutes. This report displays CPU run queue information on a per processor basis. Page two of this display shows the Utilization broken down into User mode, Nice, Negative Nice, Realtime, System, Interrupts, Context Switches, Trap and Virtual Faults on a per-processor basis.

glance — CPU by Processor


CPU BY PROCESSOR Users= 4CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid--------------------------------------------------------------------------------

0 Enable 25.4 0.6/ 0.4/ 0.3 72187 1061

Page 1 of 2

CPU Util User Nice NNice RealTm Sys Intrpt CSwitch Trap Vfault--------------------------------------------------------------------------------

0 25.4 20.7 0.0 0.0 0.0 4.7 0.0 0.0 0.0 0.0

Page 2 of 2

S S N NF

S S U U B BU U R R



5-27

5–15. SLIDE: glance — Individual Process

Student Notes The glance individual process report (s key followed by the PID) displays CPU usage for an individual process, and the distribution of CPU time when executing the process (user, system, interrupt, context switch). Ideally, a process should spend more time in User/Nice/RT mode than in any of the other three modes. Also displayed on a per-process basis is the Priority and Nice values for the selected process. In addition, the total number of forced context switches (time slice expiration or process preemptions) and voluntary context switches (process putting itself to sleep) are displayed.

glance — Individual Process

B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------CPU Util | 22% 29% 51%Disk Util | 1% 7% 13%Mem Util | 91% 91% 91%Swap Util | 25% 24% 35%--------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab--------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEPUser/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mbSystem CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0Priority : 154 VM Reads : 0 Signals Recd : 339Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start TimeRunning CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998CPU Switches : 0 Bytes Xfer: 410kb

S S N NF

S S U U B BU U R R



5-28

5–16. SLIDE: glance — Global System Calls

Student Notes The glance global system calls report (Y key) displays all the system calls that have been executed system-wide. When system CPU utilization is high, this report can be used to identify on which system calls the CPU is spending most of its time.

glance — Global System Calls


GLOBAL SYSTEM CALLS Users= 4System Call Name ID Count Rate CPU Time Cum CPU--------------------------------------------------------------------------------syscall-0 0 16 3.1 0.05921 2.19037fork 2 0 0.0 0.00000 0.01398read 3 105 20.5 0.00210 0.07625write 4 47 9.2 0.00208 0.13624open 5 16 3.1 0.00143 0.03146close 6 16 3.1 0.00040 0.00848wait 7 1 0.1 0.00011 0.00031time 13 46 9.0 0.00023 0.00446chmod 15 0 0.0 0.00000 0.00009ioctl 54 503 57.8 0.00900 0.79813poll 269 277 48.5 0.00983 1.83466

Cumulative Interval: 87 secs Page 1 of 7

S S N NF

S S U U B BU U R R



5-29

5–17. SLIDE: glance — System Calls by Process

Student Notes While examining an individual process, the system calls generated by that particular process can be viewed using the L key. When the system time utilization is high for an individual process, this report can be used to view the specific system calls the process is performing, how many times the system calls are being invoked, and (most importantly) how much time is being spent by the CPU to execute the system calls. The read() and write() system calls often take the most time, as they require physical I/O to the disk drives.

glance — System Calls by Process

B3692A GlancePlus B.10.12 05:39:20 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------CPU Util | 22% 29% 51%Disk Util | 1% 7% 13%Mem Util | 91% 91% 91%Swap Util | 25% 24% 35%--------------------------------------------------------------------------------System Calls for PID: 1822, netscape PPID: 1775 euid: 503 User:roc

Elapsed ElapsedSystem Call Name ID Count Rate Time Cum Ct CumRate CumTime--------------------------------------------------------------------------------read 3 477 93.5 0.16884 742 49.1 0.24275write 4 219 42.9 0.02831 352 23.3 0.06787open 5 63 12.3 0.01396 99 6.5 0.02491close 6 9 1.7 0.00046 20 1.3 0.00104time 13 34 6.6 0.00031 89 5.8 0.00083brk 17 27 5.2 0.00171 45 2.9 0.00264lseek 19 69 13.5 0.00150 135 8.9 0.00304stat 38 4 0.7 0.00131 13 0.8 0.00415ioctl 54 636 124.7 0.01463 1167 77.2 0.02813utssys 57 0 0.0 0.00000 3 0.1 0.00013

Cumulative Interval: 15 secs Page 1 of 3

S S N NF

S S U U B BU U R R



5-30

5–18. SLIDE: sar Command

Student Notes The sar command can be used to display global statistics on several important CPU operations. Using the –u option, information can be displayed on the time the system spent in User mode, System mode, Waiting for (disk) I/O, and idle. The Waiting for (disk) I/O is not reported by any other tool. Other tools simply lump it in with idle time. An example of the sar output with the –u option is shown below:

# sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 10/14/97 08:32:24 %usr %sys %wio %idle 08:32:29 64 36 0 0 08:32:34 61 39 0 0 08:32:39 61 39 0 0 08:32:44 61 39 0 0 Average 61 39 0 0

sar Command

$ sar option <Interval size> <Number of intervals>

Options:

-u CPU Utilization (usr, sys, wio, idle)

-q Queue lengths/utilization (run, swap)

-M Above information in per-processor format

-c System calls



5-31

Using the –q command, information can be displayed on the length and utilization of the run queue and the swap queue. We are most interested at this time in the run queue. An example of the sar output with the –q option is shown below:

# sar -q 5 4 HP-UX r3w14 B.10.20 C 9000/712 10/14/97 08:33:24 runq-sz %runocc swpq-sz %swpocc 08:33:29 8 100 0 0 08:33:34 8 100 0 0 08:33:39 8 100 0 0 08:33:44 8 100 0 0 Average 8 100 0 0

The –M option is always used in conjunction with –u and/or –q. It causes the metrics to be broken down by processor, so you can see how each processor is being utilized. The –c option shows the total number of system calls being executed per second and singles out four specific system calls for further detail. They are the read(), write(), fork(), and exec() system calls. Also reported on this display is the average number of characters transferred in or out each second. An example of this output follows:

# sar -c 5 4 HP-UX r3w14 B.10.20 C 9000/712 10/14/97 08:33:24 scalls/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 08:33:29 332 3 9 0.00 0.00 38630 2657 08:33:34 435 4 24 0.00 0.00 30310 2662 08:33:39 270 3 14 0.00 0.00 6758 0 08:33:44 524 20 15 0.20 0.20 73523 0 Average 390 7 15 0.05 0.05 37187 1331



5-32

5–19. SLIDE: timex Command

Student Notes The timex command can be used to benchmark how long the execution of a particular process takes in seconds. The command measures:

• real time the amount of elapsed time from when the program started to when the program completed (sometimes referred to as the “wall clock” time).

• user time the amount of time spent by the program executing in user mode.

• sys time the amount of time spent by the program executing in kernel mode.

The example on the slide shows a total of 25.65 seconds elapsed from when the program prime_med started to when it completed. The execution spent 20.71 seconds executing in user mode and 3.43 seconds executing in kernel mode. The difference between user + system and real time is attributed to time the process spent not running on the CPU. The process may not get CPU time either because it was waiting on some resource (like disk or CPU) or because it was in a sleep state waiting for an event (like a child process waiting to finish executing).

timex Command

$ timex prime_med

real 25.65user 20.71sys 3.43



5-33

5–20. SLIDE: Tuning a CPU-Bound System — Hardware Solutions

Student Notes Practically speaking, the easiest performance gains are usually achieved by adding more and faster hardware. This could be upgrading to a faster processor, upgrading to a processor with more cache, adding another processor, or buying another system and off-loading some applications to the second system. Upgrading to a faster processor may be possible with a simple module swap, but, more than likely, it would involve upgrading your entire system to a newer model. Some systems come with two or three possible processors and yours may not have the fastest available processors. If so, you may be able to upgrade the system’s processors to faster versions without touching the rest of the system. Nowadays, it’s unlikely that you’ll be able to upgrade the cache memory or TLB to larger sizes. Each processor chip seems to come with a predetermined amount of cache and Sized TLB. Only going to a different processor chip (and thus a larger model) will you be able to affect the cache memory and TLB sizes. If your system is not yet at its full complement of processors, it may relieve your workload to add more processors. If you have a cell-based architecture, you may be able to add more processors to each cell, or even add more cells. Some servers come with extra processors

Tuning a CPU-Bound System —Hardware Solutions• Upgrade to a faster processor• Upgrade the system with a larger data/instruction cache• Add a processor to a multiprocessor system• Spread applications to multiple systems



5-34

installed, but not enabled. These systems have a feature called ICOD (Instant Capacity On Demand). By simply contacting HP, these disabled processors can be enabled, giving you more processing power with a minimum of time. If, at a later date, those processors are no longer needed, they can be disabled in a similar fashion. Finally, if you have a system which is heavily loaded and another system which is lightly loaded, it may be possible to transfer some of the tasks from the busy system to the one which is less busy. The disadvantage of these solutions is that most of them cost money.



5-35

5–21. SLIDE: Tuning a CPU-Bound System — Software Solutions

Student Notes If the easiest performance gains are upgrading the hardware, then the greatest performance gains that are likely to be achieved are improving the software. A system with the fastest and most current hardware can still run slowly if the software is not configured properly. One way to improve the performance of specific processes is to improve the priority of those processes. You can do this by improving the process's nice value or by making the process a real-time process. Or, you can reduce the nice value of other processes. Be careful when promoting a process to real time. If the process is not well-behaved, it can take over your entire system. By well-behaved, we mean that it is not compute bound and it is free of serious bugs. Running batch jobs at non-peak hours has been a standard performance solution for many years on many systems. Other software performance improvements can be realized by using PRM (Process Resource Manager), WLM (Workload Manager), or the mpctl() system call.

Tuning a CPU-Bound System —Software Solutions• Nice less important processes• Anti-nice more important processes• Consider using rtprio or rtsched on most important processes• Run batch jobs during non-peak hours• Consider using PRM/WLM• Consider using the processor affinity call mpctl()• Optimize/recompile application



5-36

5–22. SLIDE: CPU Utilization and MP Systems

Student Notes The sar command can be utilized to report CPU utilization for the overall system on a per-processor basis (when the -u and -M options are specified). In addition the -q option will report average run queue length while occupied, and percent of time occupied. Both of these metrics can assist in the evaluation of CPU loading and should be considered before making processor affinity calls. top can also show you how your CPU resource is being distributed over the system. It automatically breaks down the load and utilization percentages on a per-processor basis when invoked. Remember, when you are running a system that supports Partitions (NPars or VPars), these tools only show you what is happening within a partition, as each partition has booted its own copy of the operating system and is acting as an independent system.

CPU


CPU


System Bus

Memory

Process

mpctl (proc2)

Processor 1 Processor 2

Is each processor pulling its weight?

The sar -uqM command string can help you monitor the CPUloading on the individual processors in a MP system.

CPU Utilization and MP Systems



5-37

5–23. SLIDE: Processor Affinity

Student Notes The mpctl() system call provides a means for determining how many processors are installed in the system (or partition), how many processors are in this pset, and assigning processes or threads to run on specific processors (also known as processor affinity) or within specific psets, and much, much more. Refer to the man page for mpctl() on your system. Much of the functionality of this capability is highly dependent on the underlying hardware. An application that uses this system call should not be expected to be portable across architectures or implementations. Processor sets are supported by the pset() system call. If your version of the operating system supports psets, refer to the man page for pset() for full details.

Processor Affinity

CPU


CPU


System Bus

Memory

Process

mpctl (proc2)

Processor 1 Processor 2

The mpctl() system call assigns the calling process to a specific processor.



5-38

5-24. LAB: CPU Utilization, System Calls, and Context Switches

Directions

General Setup

Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available:

# vgdisplay –v | grep Name (Note which disks are already in use by LVM) # ioscan –fnC disk (Note any disks not mentioned above, select one) # pvcreate -f <raw disk device file> # vgextend vg00 <block disk device file>

In either case:

# lvcreate -n vxfs vg00 # lvextend -L 1024 /dev/vg00/vxfs <block disk device file> # newfs -F vxfs /dev/vg00/rvxfs # mkdir /vxfs # mount /dev/vg00/vxfs /vxfs # prealloc /vxfs/file <75% of main memory in bytes>

The lab programs are under /home/h4262/cpu/lab0

# cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system — otherwise results are unpredictable. If the executables are missing, generate them by typing:

# make all

CPU Utilization: System Call Overhead

Use the dd command to size the read and write operations. Thus their number can be varied to change the number of system calls used to transfer the same amount of information. Then we can see the overhead of the system call interface. The first command loads the entire file into buffer cache. # timex dd if=/stand/vmunix of=/dev/null bs=64k Now we take our measurements. # timex dd if=/stand/vmunix of=/dev/null bs=64k real __ user __________ system ____________



5-39

# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real __ user __________ system ____________

System Calls and Context Switches

This lab shows you the maximum system call and context switch rates that your system can take. Three programs are supplied: • syscall loads the system with system calls of one type • filestress (shell script) generates file system-related system calls • cs loads the system with context switches 1. What is the system call rate when your system is "idle"? ________________ 2. Run filestress in the background. What is the system call rate now? What system

calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar –c 10 4

3. Terminate the filestress process by entering the following commands:

# kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22)

4. Run the syscall program and again answer question 2. Is the system call rate lower or

higher than with filestress? Why? _____________________________________________________________________ Kill the syscall program, before proceeding.

# kill $(ps –el | grep syscall | cut –c18-22)

5. Using cs, compare the number of context switches on an idle system and a loaded

system. Idle ________ Loaded ______________

6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem.

# kill $(ps –el | grep cs | cut –c18-22) # rm –f /vxfs/file # umount /vxfs



5-40

5–25. LAB: Identifying CPU Bottlenecks

Directions The following labs are designed to show the symptoms of a CPU bottleneck.

Lab 1

1. Change directory to /home/h4262/cpu/lab1

# cd /home/h4262/cpu/lab1 2. Start the processes running in the background.

# ./RUN 3. Start a glance session and answer the following questions.

What is the CPU utilization? _______ What are the nice values of the processes receiving the most CPU time? _______ What is the average number of jobs in the CPU run queue? ______

4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs?

Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 5. Determine the impact of this load on user processes. Time how long it takes for the

short baseline to execute.

# timex /home/h4262/baseline/short & How long did the program take to execute? _______

6. Compare your results to the baseline established in the lab exercise in module 1, step 5. 7. End the CPU load by executing the KILLIT script.

# ./KILLIT



5-41

Lab 2

1. Change directory to /home/h4262/cpu/lab2.


# ./RUN 3. In one terminal window, start glance.

In a second terminal window run # sar -u 5 200. Answer the following questions:

What does glance report for CPU utilization? _______ What does sar report for CPU utilization? ________ What is the priority of the process receiving the most CPU time? _______ How much time is the process spending in the sigpause system call? ______ How is the process being context switched (forced or voluntary)? ______

4. Determine the impact of this load on user processes. Time how long it takes for the

short baseline to execute.

# timex /home/h4262/baseline/short &

How long did the program take to execute? _______ 5. End the CPU load by executing the KILLIT script.

# ./KILLIT



5-42



Objectives


• Describe how the HP-UX operating system performs memory management.

• Describe the main performance issues that involve memory management.

• Describe the UNIX buffer cache.

• Describe the sync process.

• Identify the symptoms of a memory bottleneck.

• Identify global and process memory metrics.

• Use performance tools to diagnose memory problems.

• Specify appropriate corrections for memory bottlenecks.

• Describe the function of the serialize command.



6-2

6–1. SLIDE: Memory Management

Student Notes Memory management refers to the subsystem within the kernel that is responsible for managing the main memory (also known as RAM) of the computer. When managing main memory, the kernel allocates memory pages (default size is 4 KB) to processes as they need space. When main memory runs low on free space, the kernel will try to free up some pages in memory by copying those pages out to swap space on disk. The swap space can be thought of as an extension of main memory (like an overflow area) that is used when main memory becomes full. Processes paged out to the swap area cannot be referenced again until they are paged back in to main memory. The term virtual memory refers to how much memory the kernel perceives as being available for allocation to processes. When the kernel allocates space to a process, it must track that page for the life of the process. Virtual memory includes main memory and swap space, as pages allocated to processes may be moved to swap space.

Example

In the slide, there are three different processes being tracked: a one-page process, a two-page process, and a three-page process. The one-page process started in main memory and was

Memory Management

Swap Space

Memory

Virtual Memory



6-3

subsequently paged to swap space. The two-page process is entirely resident in main memory. And the three-page process has been partially paged to swap space (two of three pages are on swap). From a virtual memory standpoint, the three processes are taking up six pages of memory: three pages in main memory and three pages on swap. The preceding example is pretty simple. Reality is a little more complex. Processes actually consist of two basic types of pages, text and data. The data pages have “write” capabilities and thus their contents must be preserved when they are moved out of memory (to swap space). The text pages cannot be modified by the executing program. They are initially read in from the file system. If the memory manager should want to release the space that a text page is taking, it does not have to copy it out to swap, or even back to the file system.



6-4

6–2. SLIDE: Memory Management — Paging

Student Notes The vhand daemon is responsible for keeping a minimum amount of memory free on the system at all times. The vhand daemon does this by monitoring free pages and trying to keep their number above a threshold to ensure sufficient memory for efficient demand paging. The vhand daemon utilizes a "two-handed" clock algorithm as seen on the slide. The first hand (also known as the “reference” hand or “age” hand) clears reference bits on a group of pages in an active part of memory. If the bits are still clear by the time the second hand (also known as the “free” hand or “steal” hand) reaches them, the pages are paged out. The kernel automatically keeps an appropriate distance between the hands, based on the available paging bandwidth, the number of pages that need to be stolen, the number of pages already scheduled to be freed, and the frequency in which vhand runs. In essence, the distance between the hands determines how “aggressive” vhand is behaving. It behaves more aggressively as the memory pressure increases.

1 1 1 1 0 0 1 1 1 1 1 1

1 1 0 1 1 1 1 0 1 1 1 11 0 1 1 1 1 1 1 1 0 0 1

1 1 1 1 1 0 1 1 1 1 1 11 0 1 0 1 1 1 1 1 1 1 11 1 1 1 1 1 1 0 1 1 1 11 1 1 0 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 0 0 0 00 0 1 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 1 0 0 0 0 0 00 0 0 1 0 0 0 1 0 0 0 01 F 1 1 1 F F F F F 1 F

Memory Management — Paging

ReferenceHand

FreeHand

1 = Page is being referenced

0 = Page is NOT beingreferenced

F = Freed Memory Page by vhand process

Memory

VhandProcess



6-5

6–3. SLIDE: Paging and Process Deactivation

Student Notes The system uses a combination of paging and deactivation to manage the amount of free memory. A minimum amount of free memory is needed to allow the demand paging system to work properly. No paging occurs until the free memory falls below a threshold call LOTSFREE. Upon falling below LOTSFREE, paging will occur at a minimum level – becoming more aggressive as the number of free pages decreases. If the demand for memory continues, then paging will continue. However, if the demand for memory subsides, then there is a possibility that the amount of free memory will stabilize below the LOTSFREE threshold. If free memory falls below a second threshold call DESFREE, then there is no possibility of stabilization (until free memory goes back above DESFREE) and the paging rate becomes much more aggressive compared to the initial paging rate. Finally, if free memory falls below MINFREE, then process deactivation begins. A process is chosen by the kernel to be deactivated, and it is placed on the deactivation queue. Because the process is deactivated (therefore its pages are not being referenced) vhand will be able to page all its pages (including the uarea) out to the swap partition. The process will be

Paging and Process Deactivation

Paging begins with possibility of stabilization.

Paging continues at maximum rate, with no possibility of stabilization.

LOTSFREE

DESFREEMINFREE

Process deactivationbegins to occur.Paging Scanning Rate

FreeMemPages

0 MB

Non-Kernel memory



6-6

reactivated automatically once free memory rises above MINFREE. When a process is reactivated, only the uarea is immediately paged in. Other pages are faulted in as needed. Below are the default formulae for LOTSFREE, DESFREE, and MINFREE. (NKM = Non-Kernel Memory) <=32 MB >=32 MB, <=2GB >2 GB LOTSFREE 1/8 of NKM 1/16 of NKM 64 MB DESFREE 1/16 of NKM 1/64 of NKM 12 MB MINFREE 1/2 of DESFREE 1/4 of DESFREE 5 MB NOTE The values of LOTSFREE, DESFREE, and MINFREE were made tunable

kernel parameters in HP-UX 11.00. Prior to the 11.00 release, these values were fixed and could not be changed. It is recommended by HP, however, that the parameters not be tuned manually.



6-7

6–4. SLIDE: The Buffer Cache

Student Notes The buffer cache exists to speed up file system I/O. The system tries to minimize disk access by going to disk as infrequently as possible, because disk access is often a bottleneck on most systems. Therefore, the most recently- or commonly-accessed files from disk persist in the portion of memory called the buffer cache. It is called dynamic because the size of the buffer cache grows or shrinks dynamically, depending on competing requests for system memory. Its minimum size is governed by the tunable parameter dbc_min_pct, and it cannot grow larger than the size specified in dbc_max_pct. These two parameters are expressed as percentages of total physical memory on the system. Let's say dbc_min_pct is set to 10, while dbc_max_pct is 50. This means that initially 10% of physical memory is allocated to the buffer cache. As the system needs more space to buffer files read in from disk, the buffer cache will allocate more memory, and this will continue until it occupies 50% of memory, its maximum size. Later, when the system requires more memory for another use, say processes, the buffer cache could shrink an appropriate amount, but will never be less than the 10% minimum value. Therefore, a larger buffer cache is able to hold more files and will minimize their access time but will leave less memory available for other uses.

The Buffer Cache

• Pool of memory designed to retain the most commonly accessed files from disk

• Used only for file system I/O (not raw I/O)

• Size of buffer cache controlled by dbc_min_pct and dbc_max_pct

Filesystem

Memory

FileProcess

Buffer Cache



6-8

NOTE: The buffer cache is dynamic in nature only when two other tunable

parameters, bufpages and nbuf, are both set to their default values of 0. Another example: if dbc_min_pct and dbc_max_pct are both set to the same value, say 20, the kernel will always use exactly that percentage of physical memory for the buffer cache.



6-9

6–5. SLIDE: The syncer Daemon

Student Notes For disk writes, data flows from the buffer cache to disk. How does it get to the buffer cache? The kernel writes data to it. The syncer process takes care of flushing data in the buffer cache to the files on the disk. When a user edits a file, makes changes to that file, and saves the changes, those changes do not go to disk right away. The kernel writes the data to the buffer cache, and some time later (within 60 seconds) the data finally arrives at the disk. This time period is chosen as a balance between ensuring that the file system is fairly up-to-date in case of a crash and efficiently performing disk I/O. There are many applications that do not rely on the operating system's built-in processes to flush data to disk, but instead take over that operation themselves. In other words, they create their own buffers and manage the flushing at appropriate intervals. A common example is a database application that needs to guarantee the completion of a transaction within a specified time interval.

The syncer Daemon

• All entries stay in the buffer cache for a minimum of 30 seconds before being flushed.

• The syncer daemon runs once every 6 seconds and flushes 20% of the buffer cache to disk.

Filesystem

Memory

File

Buffer Cache

syncer

flushes



6-10

6–6. SLIDE: IPC Memory Allocation

Student Notes UNIX implements interprocess communications using different mechanisms. Three mechanisms that require additional system memory are semaphores, shared memory, and message queues. • Semaphores are used to synchronize memory resources between competing processes.

• Shared memory segments are resources capable of holding (in memory) large amounts

of data that can be shared between processes.

• Message queues hold strings of information (messages) that can be transferred between processes. Two types of processes that utilize message queues are networking and database processes.

Shared memory provides a mechanism to reduce interprocess communication costs significantly. Two processes that are ready to share data, address the same portion of shared memory into their addressable space. Changes made to the shared memory are seen immediately by all processes and do not require kernel services. So from a kernel perspective, other than initially setting up the shared memory, there is very low cost in using shared memory.

IPC Memory Allocation

Memory

Shared MemorySegment

Text

Data

Sh. Mem

Sh. Lib

Text

Data

Sh. Mem

Sh. Lib

Shared MemorySegment

# ipcs -mobIPC status from /dev/kmem as of Sat Feb 14 06:53:27 1998T ID KEY MODE OWNER GROUP NATTCH SEGSZShared Memory:m 5 0x06347849--rw-rw-rw- root root 0 77384m 7 0x000c0568 --rw------- root root 2 131516



6-11

On the slide, each process has a shared memory segment that references one and the same shared memory area. The more processes that allocate shared memory segments, the higher the memory usage. The shared memory segments in physical memory can be viewed with the ipcs -mob command or a reporting tool like glance. From time-to-time, they might have to be cleaned up or removed manually if an application terminates ungracefully. This is done by the superuser with the ipcrm command. A worthwhile baseline measurement for a system administrator is to run the ipcs -mob command during a quiet period. It is also eye opening to repeat this command when the system is at its busiest.



6-12

6–7. SLIDE: Memory Metrics to Monitor — Systemwide

Student Notes The utilization of memory can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor memory usage. The first metrics you want to look at are those that will tell you whether vhand is active.

Pages Scanned by vhand

This is the number of pages the vhand process has scanned (i.e. dereferenced with the reference hand) when looking for pages to free in memory. This tells you that vhand is actively scanning pages in an attempt to free them up. There is some memory pressure.

Pages Freed by vhand

This is the number of pages the vhand process has freed (i.e. the reference bit was still dereferenced when the free hand looked at it). The ratio between pages scanned and pages freed indicates how successful the vhand process is when looking for memory pages to free.

Memory Metrics to Monitor — Systemwide

• Is vhand Active?– Pages scanned by vhand (SR)– Pages freed by vhand (FR)– Pages paged out

• Is swapper Active?– Processes deactivated (SO)– Amount of free memory relative to

- lotsfree- desfree- minfree

• Size of dynamic buffer cache• Size of IPC Shared memory segments



6-13

Amount of Paging

This indicates the level of disk activity to the swap partition. If a consistent amount of paging to swap space is occurring, then performance is impacted (most likely significantly). Next, check to see if the swapper is active.

Process Deactivations

This indicates that processes are being deactivated, meaning free memory has fallen below the MINFREE threshold. There is severe memory pressure.

Amount of Free Memory

This indicates the severity of the free memory situation. If free memory has fallen below LOTSFREE, then we know some paging has taken place. vhand is active. If it is below DESFREE, then the situation is more severe, and much more paging is occurring. vhand is aggressively active. Finally, if free memory is below MINFREE, then a high level of paging and process deactivation is occurring. vhand and swapper are both active. To determine what the values are for lotsfree, desfree, and minfree, use the following commands:

# echo “lotsfree/D” | adb –k /stand/vmunix /dev/mem # echo “desfree/D” | adb –k /stand/vmunix /dev/mem # echo “minfree/D” | adb –k /stand/vmunix /dev/mem

The settings for these three values in the kernel will then be displayed in 4K pages. You can then compare them to the current size of the free page list. These values will not change, unless you change the size of Non-Kernel Memory. (Remember the formulas shown earlier?)

Size of Dynamic Buffer Cache

This is the amount of memory being consumed by the buffer cache. If memory is full and the buffer cache is large, it will most likely cause paging, since the buffer cache typically shrinks slower than the rate at which new memory is needed. Heavy disk I/O demands may prevent the buffer cache from shrinking at all.

Size of IPC Shared Memory Segments

This is the amount of memory used for interprocess communications. Of special interest will be the number and sizes of shared memory segments, as these can be quite large, especially if graphical applications or a database management system is being used.



6-14

6–8. SLIDE: Memory Metrics to Monitor — per Process

Student Notes Individual processes vary greatly in terms of the amount of memory they use. Metrics to monitor memory utilization on a per-process basis include the following:

Size of RSS/VSS

The Resident Set Size (RSS) for a process is the portion of the process (in KB) that is currently resident in physical memory. Since the entire process does not have to be resident in memory in order to execute, this shows how much of the process is actually resident in memory. The Virtual Set Size (VSS) for a process is the total size of the process (in KB). This indicates that if the entire process were to be loaded, this is how much memory the entire process would consume. Very rarely is the entire process resident in memory. If the entire process were in memory, then the RSS value would be equal to the VSS value.

Size of Text, Data, and Stack Segments

These are the RSS and VSS sizes for the three main components of a process. Since every process has a single text, data, and stack segment, these values should be monitored, especially for large processes. The data segment is the most likely to be large.

Memory Metrics to Monitor — per Process

• Size of RSS/VSS

• Size of text, data, and stack segments

• Number of shared memory segments

• Amount of time blocked on virtual memory



6-15

Each of these three segments has a maximum size to which they can grow – limited by tunable kernel parameters. They are maxtsiz, maxdsiz, and maxssiz for a 32-bit process. They are maxtsiz_64bit, maxdsiz_64bit, and maxssiz_64bit for a 64-bit process. If a process tries to grow one of these segments beyond its maximum size, then the process terminates (and in some cases “core dumps”).

Number and Size of Shared Memory Segments

These are the shared memory segments to which a process is attached. The maximum size of a shared memory segment is limited by the kernel parameter, shmmax. The number of shared memory segments a process can attach to is limited by the kernel parameter, shmseg.

Amount of Time Spent Blocked on Virtual Memory

This is the amount of time the process was prevented from executing because it was waiting (or blocked) on a text or data page to be paged in.



6-16

6–9. SLIDE: Memory Monitoring vmstat Output

Student Notes A useful command to view virtual memory statistics is vmstat. The slide shows vmstat's output being updated every 5 seconds. When viewing vmstat's output, always keep an eye on the po (pages paged out) parameter. Ideally, you want this to be zero, indicating no paging out is occurring. Statistics regarding the vhand algorithm, the fr (pages freed per second) and sr (pages scanned by the clock algorithm, per second) parameters show the actual behavior of vhand.

Output Headings

procs

r In run queue b Blocked for resources (I/O, paging, and so on) w Runnable or short sleeper (less than 20 seconds) but deactivated

Memory Monitoring vmstat Output

#=> vmstat -n 5VM

memory page faultsavm free re at pi po fr de sr in sy cs 9140 3824 3 4 0 0 0 0 0 675 824 140

CPUcpu procs

us sy id r b w 9 5 86 1 100 09017 3500 41 49 11 0 0 0 0 1257 2823 329

24 17 60 0 100 010292 2255 65 20 41 0 0 0 0 1419 3795 481

67 24 9 5 102 010227 976 89 19 85 0 0 0 0 1698 4771 641

67 33 0 7 103 010958 400 81 12 91 48 26 0 194 1791 5847 697

67 31 3 8 110 010759 454 33 3 98 51 24 0 268 1598 4313 598

62 20 18 6 111 013448 404 21 0 65 74 39 0 282 1021 3175 354

32 15 53 0 118 0



6-17

memory

avm Active virtual pages (run during the last 20 seconds) free Size of free list page (in 4K pages) re Page reclaims per second at Address translation faults per second (page faults) pi Pages paged in per second po Pages paged out per second fr Pages freed per second defacto Anticipated short term memory shortfall sr Pages scanned by algorithm per second

faults

in Non-clock device interrupts per second sy System calls per second cs CPU context switches per second

CPU

us Percentage of time CPU spent in user mode sy Percentage of time CPU spent in system mode id Percentage of time CPU is idle

with -S option

si Processes reactivated per second so Processes deactivated per second



6-18

6–10. SLIDE: Memory Monitoring glance — Memory Report

Student Notes glance has extensive memory monitoring abilities. Like vmstat, it can give paging statistics, in addition to showing if any processes are being deactivated. Remember, this is an indication of severe memory shortage. There is other valuable information on this report, such as the statistics at the bottom showing the current Dynamic Buffer Cache size, the current amount of Free Memory, and the total Physical Memory in the system.

Memory Monitoring glance —Memory Report


MEMORY REPORT Users= 19Event Current Cumulative Current Rate Cum Rate High Rate--------------------------------------------------------------------------------Page Faults 78 287 7.5 24.3 139.3Paging Requests 3 21 0.2 1.7 12.0KB Paged In 52kb 336kb 5.0 28.4 189.3KB Paged Out 0kb 0kb 0.0 0.0 0.0Reactivations 0 0 0.0 0.0 0.0Deactivations 0 0 0.0 0.0 0.0KB Reactivated 0kb 0kb 0.0 0.0 0.0KB Deactivated 0kb 0kb 0.0 0.0 0.0VM Reads 3 6 0.2 0.5 2.0VM Writes 0 0 0.0 0.0 0.0

Total VM : 78.9mb Sys Mem : 10.6mb User Mem: 78.0mb Phys Mem: 128.0mbActive VM: 23.4mb Buf Cache: 19.1mb Free Mem: 20.3mb

Page 1 of 1

S S N NF

S S U U B BU U R R



6-19

6–11. SLIDE: Memory Monitoring glance — Process List

Student Notes The glance Process List report can be used to monitor process statistics, including how much memory processes are currently consuming. The highlighted column, RSS (Resident Set Size), shows memory being used on a per-process basis. Very simply put, this helps to identify the "memory hogs" on the system. For example, the process called netscape has an RSS of 14.7 MB, while statdaemon is minimal. Other large processes include glance, xload, and dtterm. What do all these processes have in common? They are all GUI (graphical user interface) programs running as windows in a graphical window environment. Moral: programs that open their own windows are relatively memory-intensive and should be minimized. Users should be encouraged not to leave several windows open on their screens if they do not have a continuing need for them.

Memory Monitoring glance —Process List


PROCESS LIST Users= 11User CPU Util Cum Disk Thd

Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Cnt--------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1

Page 1 of 13

S S N NF

S S U U B BU U R R



6-20

6–12. SLIDE: Memory Monitoring glance — Individual Process

Student Notes The glance Individual Process report displays memory usage for an individual process, and the RSS and VSS sizes for the process. Also displayed on a per-process basis, is the VM reads and VM writes being performed by the process. This indicates how much paging from/to the swap device the individual process is performing. If performance is poor for an individual process, this is a good field to check.

Memory Monitoring glance —Individual Process

S S N NF

S S U U B BU U R R

B3692A GlancePlus C.03.70.00 15:52:03 r206c42 9000/800 Current Avg High--------------------------------------------------------------------------------CPU Util | 15% 15% 15%Disk Util | 1% 0% 2%Mem Util | 96% 96% 96%Swap Util | 15% 15% 15%--------------------------------------------------------------------------------Resources PID: 28030, glance PPID: 27993 euid: 0 User: root--------------------------------------------------------------------------------CPU Usage (util): 0.1 Log Reads : 1 Wait Reason : STRMSUser/Nice/RT CPU: 0.1 Log Writes: 0 Total RSS/VSS : 3.6mb/ 5.6mbSystem CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 1/ 10Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 6/ 0Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0Priority : 154 VM Reads : 0 Signals Recd : 0 Nice Value : 10 VM Writes : 0 Mesg Sent/Recd : 0/ 0Dispatches : 6 Sys Reads : 0 Other Log Rd/Wt: 38/ 172Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0VoluntaryCSwitch: 4 Raw Reads : 0 Proc Start TimeRunning CPU : 0 Raw Writes: 0 Tue Mar 16 15:49:14 2004CPU Switches : 0 Bytes Xfer: 0kb :

C - cum/interval toggle % - pct/absolute toggle Page 1 of 1



6-21

6–13. SLIDE: Memory Monitoring glance — System Tables

Student Notes The glance System Table report displays the size of kernel tables in memory, and the current utilization of theses tables. It is important not to set the size of these tables too large, as the tables are memory resident (and the bigger the table, the more memory it consumes). Yet, it is even more important that enough resources be allocated so that the kernel does not have to wait for a resource to become free (or even error out) when a particular resource is requested. The Available column displays the total size of the particular table, and the Used column shows how many entries within the table are currently being used. In general, the Used value should not be close to the Available value. If it is, then the kernel is close to running out of that particular resource. The High % column shows the high water mark for the resource since glance has been running. Also of interest in this report are the buffer cache statistics, especially the Buffer Cache that shows the current size of the buffer cache.

Memory Monitoring glance —System Tables

S S N NF

S S U U B BU U R R

B3692A GlancePlus C.03.70.00 15:58:40 r206c42 9000/800 Current Avg High--------------------------------------------------------------------------------CPU Util | 15% 15% 15%Disk Util | 0% 0% 4%Mem Util | 96% 96% 96%Swap Util | 15% 21% 45%--------------------------------------------------------------------------------

SYSTEM TABLES REPORT Users= 1

System Table Available Requested Used High--------------------------------------------------------------------------------Inode Cache (ninode) 2884 na 645 645Shared Memory 12.5gb 11.1mb Message Buffers 800kb na 0kb 0kbBuffer Cache 314.4mb na 314.4mb naBuffer Cache Min 32.0mb Buffer Cache Max 320.0mb DNLC Cache 8004

Model : 9000/800/A400-6X Phys Memory :640.0mb Network Interfaces : 2OS Name : HP-UX Number CPUs : 1 Number Swap Areas : 2OS Release: B.11.11 Number Disks: 2 Avail Volume Groups: 1OS Kernel Type: 64 bits Mem Region Max Page Size: 1024mb

Page 2 of 2



6-22

NOTE: There are two pages to this report. Shown here is the second page of this report. More system tables are shown on the first page.



6-23

6–14. SLIDE: Tuning a Memory-Bound System — Hardware Solutions

Student Notes An obvious hardware solution to a memory bottleneck is to add more physical memory. While this solution requires an outlay of money, it may pay for itself quickly by saving the system administrator hours of time looking for ways to reduce memory consumption. If adding more memory is not an option, then a second hardware suggestion is to look at the use of X terminals on the system. An X terminal typically consumes a large portion of memory. X terminals will take up 34 MB of memory for light application usage, and as much as 1020+ MB for heavy application usage. These figures do not take into account any additional RAM that the system will use for window managers or any other X-related overhead.

Tuning a Memory-Bound System —Hardware Solutions• Add more physical memory

• Reduce usage of X-terminals



6-24

6–15. SLIDE: Tuning a Memory-Bound System — Software Solutions

Student Notes Quite often, users will run X-windows type programs to enhance the look of their desktop. Examples include an X-eyes program, a bouncing ball program, or fancy screen savers. All of these graphical programs consume system resources, including memory. The biggest consumer of memory will most likely be the buffer cache. We saw earlier that if the buffer cache is dynamic, it will grow to its maximum size, as long as memory is available. The problem with this is when a process needs additional memory, and the free memory is below LOTSFREE, then the buffer cache is slow to shrink (if at all!), causing paging to occur among the processes. To prevent this situation, the tunable parameter dbc_max_pct should be tuned to limit the maximum size in which the buffer cache can grow. A recommendation for dbc_max_pct is 25 or less. Programs with memory leaks will allocate memory and then stop using – without returning it to the system for use elsewhere. These programs may require you to shut them down periodically, to release the memory. They may even require you to reboot the system occasionally to reclaim the memory. There are a number of third party tools that will help you locate memory leaks in applications – such as Purify.

Tuning a Memory-Bound System —Software Solutions• Look for unnecessary processes

– Extra windows– Screen Savers– Long strings of child processes

• Reduce dbc_max_pct (max size of dynamic buffer cache).• Identify programs with memory leaks.• Check for unreferenced shared memory segments.• Use serialize command to reduce process thrashing.• Use PRM to prioritize memory allocation.



6-25

Unreferenced shared memory segments can also be a problem. An application sets one up and then forgets to deallocate it when the application exits. Here is a possible procedure for locating abandoned shared memory segments: First, look for any shared memory segments that have no processes attached to them.

# ipcs –ma Note which shared memory segments have a “0” in the NATTCH column. If they are owned by “root”, let them stay. Otherwise, write down their ID numbers and their CPID numbers.

Second, one at a time, find out whether the creating process still exists.

# ps –el | grep <CPID number> If it does, it’s probably just a quiescent segment, But if not, the segment is probably abandoned. Finally, remove the segment.

# ipcrm –m <ID number>

The serialize command will be discussed later in this chapter. You may wish to use PRM to control your memory resource and its allocation.



6-26

6-16: SLIDE: PA-RISC Access Control

Student Notes Since we are discussing system memory and performance there is one other topic that we should think about, hardware based memory page access control. The processor architecture has several features related to assuring that a process thread can not access areas of physical memory that are not part of its process space. An in depth discussion of page access control is presented in the HP-UX training course; "Inside HP-UX”, course number H5081S and we won't attempt to recreate it here. There is one particular aspect of this hardware feature that we will spend some time with in discussion though, and that is "Protection ID's". Every discrete region of virtual memory assigned to a process (text space, private data space, shared memory space, shared library data space, etc) is assigned a unique ID "key", called an Access Key. Any process attempting to access that memory space must have a copy of a matching ID "key", called a Protection Key. To speed things up, the most frequently or likely used Protection Keys are kept in processor registers. (These registers are part of a process thread’s "context" and are preserved across switches and interrupts.) The hardware performs the Protection check as part of the actual memory access instruction.

PA-RISC Access Control

Memory resource

Control Registerresident

Access ID keys

Access IDkeys stored

in the kernel tables



6-27

Now here is the catch, there is only room in the control registers for a limited number of frequently used Protection Keys. The rest are stored in kernel space in memory management tables, which are accessed when a protection ID fault occurs. The fault handler will search for and find these other "keys" when they are needed but at the cost of CPU cycles! To better understand the dynamics of this process consider the following analogy: The Key Ring I have many keys to many locks around my home and office. It is not practical to carry all of my keys around with me all the time due to their bulk and weight. To solve this problem I have two key rings. One is small and has only those keys that I need on a daily basis, my car key, house key, desk key, and garage key. The other key ring is large and bulky with dozens of other miscellaneous keys; my workshop, tool boxes, garden shed, lawnmower (wish I could loose that one!), boat ignition, etc… This method is a blessing and a curse. When I need to start the car or unlock the front door the key I need is readily available in my pocket and I can quickly gain access. When I actually have time to go fishing, it is always a hassle to go find my utility key ring and remember to take the boat key with me. (Once I actually hauled the boat all the way to the lake, several miles away from my home, only to realize that I had not remembered the boat key!). To somewhat address this problem I now move the boat key to my everyday key ring during the summer months (replacing the snow-blower key) and reverse the procedure in the fall. The HP-UX kernel performs a similar process, every time a protection ID fault occurs. The fault handler moves the key it had to search for to the register context of the thread (replacing the least recently used key). PA-RISC 1.x has room for 4 keys in the register context while PA-RISC 2.X has room for 8 keys. IA-64 has room for at least 16 keys. Depending on how frequently a process moves from one memory region to another the number of protection ID faults will vary. With the larger number of Protection Registers in the later processors, Protection Register thrashing has become much less a problem than it has been in the past. It should also be noted that shared library regions on 11.x were modified to use a type of "skeleton" key, i.e., a key that always matches so that attempted access to them will never result in a protection ID fault.



6-28

6–17. SLIDE: The serialize Command

Student Notes The serialize (1) command can help if a system has a number of large processes and is experiencing memory pressures. The serialize command will allow these big processes to run one after another, instead of running all at the same time. By running the processes sequentially, rather than in parallel, the CPU can spend more time executing the process code (i.e. user mode) and less time managing the competing processes (i.e. kernel mode).

Thrashing

On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy moving pages in and out that the system spends too much time paging and not enough time running processes. When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, meaning it is doing more overhead work than productive work.

The serialize Command

Memory

OS Tables

KernelSwap Space

Proc I Proc J

Proc K Proc L

500MB of Available Memory

Each process:CPU boundLarge (400 MB)Timeshare priority



6-29

How serialize Helps Reduce Thrashing

All processes marked via the serialize command will run serially with other processes marked the same way. The serialize command addresses the problem caused when a group of large processes all try to make forward progress at once, which results in degrading throughput. In such a case, each process constantly faults in its working set, only to have the pages stolen when another process starts running. By using the serialize command to run large processes one at a time, the system can make more efficient use of the CPU, as well as system memory. Let’s look at the example on the slide. We have a system with 500MB of available memory. We are trying to execute four processes. Each process is CPU bound, has large memory requirements (400MB), and has a timeshare priority level. The first process (I) executes. As it executes, its pages are faulted into memory. At the end of its timeslice (typically 100ms), it is switched out and process J is started. As it executes, it pages in a large number of pages – forcing the pages belonging to process I to be paged out. 100ms later process J is switched out and process K starts up, pulling its pages into memory and pushing the other process’s pages out. The system spends so much time pulling pages in and pushing pages out, that it literally has no time left to perform any useful work. The culprit here is the timeslice. OK, we could simply disable timeslicing altogether via the tunable parameter (timeslice). But that may be overkill – more than we want to do. After all, it’s just these four processes that are causing the thrashing. A better solution would be to “serialize” these processes. when you do that, each process executes until it either voluntarily gives up the CPU or it is preempted by a stronger priority process – which will happen much less frequently than the timeslice! Thus more real work will get done and much less paging will be needed. In 10.20, the kernel was given the authority to serialize processes automatically, if it detects that memory thrashing is taking place and it can identify which processes are responsible for the thrashing.



6-30

6–18. LAB: Memory Leaks There are several performance issues related to memory management, memory leaks, and swapping/paging, protection ID thrashing…. Let's investigate a few of them.

1. Change directories to /home/h4262/memory/leak:

# cd /home/h4262/memory/leak

Memory leaks occur when a process requests memory (typically through the malloc()or shmget() calls) but doesn't free the memory once it finishes using it. The five processes in this directory all have memory leaks to different degrees.

2. Before starting the background processes, look up the current value for maxdsiz using the kmtune command on 11i v1 and the kctune command on 11i v2. On the rp2430:

# kmtune –lq maxdsiz On the rx2600:

# kctune –avq maxdsiz The default maxdsiz on 11i v2 is 1 GB. This will make proc1 very slow in reaching its limits. You can change maxdsiz to a more reasonable number for this lab exercise by: # kctune maxdsiz=0x10000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000 Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? # vmstat 2 2



6-31

3. Use the RUN script to start the background processes:

# ./RUN

4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions — fairly quickly, before the memory leaks get too large.

• What is the current amount of free memory? • What is the size of the buffer cache? • Is there any paging to the swap space? • How much swap space is currently reserved? • Which process has the largest Resident Set Size (RSS)? • What is the data segment size of the process with the largest RSS?

5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs.

• What happens when the process reaches its maximum data size? • Why does disk utilization become so high at this point?

6. As the other processes grow towards their maximum data segment size, continue to monitor the following:

• Free memory • Swap space reserved • The size of the processes' data segments • The RSS of the processes • The number of page-outs/page-ins to the swap space



6-32

7. Run the two baseline programs, short and diskread.

# timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread

How does the performance of these programs compare to their earlier runs?

8. When finished monitoring the behavior of processes with memory leaks, clean up the processes.

• Exit glance. • Execute the KILLIT script:

# ./KILLIT

• If you changed maxdsiz, change it back: # kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000


7-1


Objectives


• Describe the difference between swap usage and swap reservation.

• Interpret the output of the swapinfo command.

• Define and configure pseudo swap.

• Define and configure swap space priorities.

• Define and configure swchunk and maxswapchunks.



7-2

7–1. SLIDE: Swap Space Management — Simple View

Student Notes The purpose of “swap space” is to relieve the pressure on memory when memory becomes too full. When “free memory” falls below a certain threshold, processes (or parts of processes) will be written out to the swap partition on disk in order to free up space in memory for other processes. For simplicity, the above slide assumes each process is 1 MB in size, and the amount of available memory for process execution is 20 MB. The slide also assumes (for simplicity) that each process “reserves” 1-MB on the swap partition each time it executes. Therefore, since 20 processes are currently present in memory (as shown on the slide), 20 MB of swap space has been reserved—1 MB for each process. The HP-UX operating system “reserves” swap space for each process that executes on the system. The reservation of swap space is done so that the operating system knows how much swap space “potentially” may be needed for all the processes currently running on the system. For example, if all the processes in memory were to be swapped out, the operating system would know it had enough swap space to perform that function.

Swap Space Management — Simple View

Memory

CPU

Disk

Processes

Swap

Kernel andOS Tables

UsrProgram

Reserved: 20 MBUsed : 0 MB

New program wants to execute; not enough space for program to fit into memory.

Swap (55 MB)



7-3

Analogy

A good analogy for swap space reservation, is a hotel that takes room reservations. When a hotel takes a reservation, it subtracts one from the count of available rooms. If a hotel had 55 rooms, and it took 20 reservations, then it would only have 35 rooms still available, even though none of the 55 rooms were currently occupied. The same holds true for swap space. In the above example, a total of 55 MB of swap space exists, 20 MB of the space is “reserved” by processes currently running in memory, even though none of the processes are currently using the swap space they have reserved. To take the analogy even further, the hotel does not earmark a particular room to satisfy a reservation. Room assignments are done when the occupant shows up at the front desk. Likewise, a swap reservation is not associated with a particular block out on the swap device. Only when the kernel actually wants to move a page in memory out to the swap device does it select a block. It knows it has the swap space available. It just doesn’t know where it is until it needs to use it.

Current Situation

In the above slide, all the memory is in use by the 20 processes. Now assume a new program from disk wants to execute. What happens? How does it fit in memory if all the memory is in use?



7-4

7–2. SLIDE: Swap Space — After a New Process Executes

Student Notes Below is the basic sequence of steps that occurs when a new process wants to execute and there is not enough memory available: 1. The operating system selects a process (or portion of a process) to be written out to the

swap partition on disk. The process selected is one that is not expected to execute in the near future.

2. Once the process is written to the swap partition, the amount of swap space used is

incremented accordingly and the amount of swap space reserved is decremented by the same amount.

3. The new program which wants to execute reserves swap space for itself. The amount of

swap space reserved is incremented accordingly. 4. The new program is copied into memory and the operating system initializes the process.

The new process uses the physical memory that was just freed.

Swap Space —After a New Process Executes

1

Memory

CPU

Disk

Processes

Swap

Kernel andOS Tables

UsrProgram

Reserved: 20 MBUsed : 1 MB

23

4



7-5

7–3. SLIDE: The swapinfo Command

Student Notes The swapinfo command displays important swap-related information, including how much swap space is used, and how much swap space is reserved. With today’s systems, we recommend that you always use the –m option to display all spaces in MB rather than the default KB. The swapinfo –mt command shows information related to device (raw) swap partitions and file system swap space and their totals, including: Mb AVAIL The total amount of swap space available. For file system swap, this value

may vary, as more swap space is needed. Mb USED The current amount of swap space being used. Mb FREE The current amount of swap space free. The Mb FREE plus Mb USED is

equal to Mb AVAIL. PCT USED The percentage of swap space in use on that device.

The swapinfo Command

# swapinfo -mtMb Mb Mb PCT START/ Mb

TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAMEdev 32 1 31 3% 0 - 1 /dev/vg00/lvol2localfs 23 0 23 - none 0 1 /home/pagingreserve - 20 -20total 55 21 34 38% - 0 -



7-6

START/LIMIT Applies only to file system swap. Specifies the starting block within the file system of the paging file. The LIMIT specifies the maximum size to which the paging file can grow.

Mb RESERVED Applies only to file system swap, and is only applicable when no limit is

given to the maximum size of the paging file. In these situations, this value specifies how much file system space to reserve for user files on the file system.

PRI The priority of the swap area. The highest priority swap areas are used first.

The swap priorities range from 0-10. (Note: stronger priority swap areas have smaller priority numbers.) The swapinfo command also shows how much swap space all the processes on the system are reserving currently. This is indicated by the reserve entry. The columns described above for device and file system swap do not apply to the reserve entry in the output of the swapinfo command. In the example, there are 32 MB of device swap on a raw disk, and 23 MB of swap in the /home file system, making a total of 55 MB. 1 MB is in use on the device swap and 20 MB are reserved, leaving 34 MB available.



7-7

7–4. SLIDE: Swap Space Management — Realistic View

Student Notes An earlier slide implied that specific space was allocated on a swap device for each process running in memory. The analogy was of a hotel subtracting one from the count of available rooms when a customer phoned in for a reservation. As mentioned earlier, specific space is not allocated on a swap device for a reservation. Instead, a variable is maintained call SWAP_AVAIL. The SWAP_AVAIL variable is initialized when the system boots to equal the total amount of swap space available. As each new process begins executing, this variable is decremented according to the amount of swap space the process would need if its entire contents were to be swapped out. When a process terminates, it returns the amount of swap space it reserved back to the SWAP_AVAIL variable. The slide above shows what the SWAP_AVAIL variable would contain when 20 MB worth of processes is executing on the system. Each process has caused the SWAP_AVAIL variable to be decremented, but no specific space has been allocated on the swap partition. No specific swap space is allocated until processes need to be paged out, as shown on the next slide.

Swap Space Management — Realistic View

Memory

CPU

Disk

Processes

Swap (55 MB)

Kernel andOS Tables

Program

Reserved : 20 MB

Used : 0 MB

New program wants to execute; not enough memory for program to fit.

Swap Avail : 35 MB

Reserved : 0 MB

Used : 0 MB

Swap Avail : 55 MB

Initial Allocation

Current Allocation



7-8

7–5. SLIDE: Swap Space — After a New Process Executes

Student Notes This is an updated description of the sequence of events that occurs when a program is being executed and not enough memory is available:

• The operating system selects a process (or portion of a process) to be written out to the swap partition on disk. Since no specific swap space has been reserved, swap space is allocated from the strongest priority swap device, first available block.

• Once the process is written to the swap partition, the amount of swap space used is incremented accordingly, and the old program “unreserves” its swap space by incrementing the SWAP_AVAIL variable.

• Then the new program decrements SWAP_AVAIL to reserve its swap space. In effect, the amount of swap space reserved is decremented by the amount of space being moved out to swap space and then incremented by the new reservation amount. In the slide, the process being swapped out causes the USED swap to become 1 MB, causing the SWAP_AVAIL to become 34 MB. Then the old process releases its 1 MB reservation, causing the SWAP_AVAIL to increase back to 35 MB. Finally, the new process starts up and causes the SWAP_AVAIL to decrease from 35 to 34 MB.

Swap Space —After a New Process Executes

Memory

CPU

Disk

Processes

Kernel andOS Tables

Program

Reserved : 20 MBUsed : 1 MBSwap Avail : 34 MB

Current Allocation

1

32

Swap (55 MB)



7-9

• The new program is copied into memory, and the operating system initializes the process after it has confirmed that it can successfully reserve the needed swap for the new process (SWAP_AVAIL does not go negative when the swap reservation is made).



7-10

7–6. SLIDE: Swap Space — When Memory Equals Data Swapped

Student Notes The above slide shows the state of the system and the current swap space allocations when 20 MB (or all of available memory) has been paged out to the swap partition. The swap partition contains 20 MB worth of processes, which is the size of available memory. The initial 20 MB of processes is shaded in gray, to distinguish them from the second 20 MB of processes, which are filled with black. With this color code, we can see only 4 MB of the original processes are still loaded in memory, everything else (including 4 MB of the 21st to 40th processes) has been paged to the swap partition. The swap space allocation reflects 20 MB worth of processes that have reserved swap space, and 20 MB that is currently in use. This would be analogous to stating that a hotel received 40 room reservations, and 20 of those reservations are currently being used. The SWAP_AVAIL variable is down to 15 MB, because the total amount of swap space is 55 MB and 40 MB of that space is reserved or in use.

Swap Space —When Memory Equals Data Swapped

Available Memory(20 MB)

CPU

Disk

ProcessesProgram

Reserved : 20 MB

Used : 20 MB

Swap Avail : 15 MB

Current Allocation

1

32

Swap (55 MB)

Kernel andOS Tables



7-11

7–7. SLIDE: Swap Space — When Swap Space Fills Up

Student Notes The above slide shows the situation when SWAP_AVAIL equals 0 MB. In this situation, the error message, ERROR: no swap space available is displayed, even though there is swap space to page an existing process to the swap partition and thus free up memory for a new program to load. The reason the system reports no swap space is available is because 35 MB of memory have been paged out, and the remaining 20 MB of swap space are reserved by the existing processes currently executing in memory.

Could this error have been prevented?

From a resource perspective, the new program should be able to execute, because memory is available for the new process. A tunable OS parameter, referred to as pseudo swap, would have allowed the program to execute under these conditions.

Swap Space — When Swap Space Fills Up


CPU

Disk

Processes

Kernel andOS Tables

Program

Reserved : 20 MB

Used : 35 MB

Swap Avail : 0 MB

Current Allocation

1

2

Swap (55 MB)

ERROR: no more swap space

Q: Could this error have been prevented?A: YES!! Use pseudo swap.



7-12

7–8. SLIDE: Pseudo Swap

Student Notes Pseudo swap is HP's solution for large memory customers who do not wish to purchase a large amount of disks to use for swap space. The justification for purchasing large memory systems is to prevent paging and swapping, therefore, the argument becomes, “Why purchase a lot of device swap space if the system is not expected to page or swap?” Pseudo swap is swap space that the operating system recognizes, but in reality it does not exist. Pseudo swap is make-believe swap space. It does not exist in memory; it does not exist on disk; it does not exist anywhere. However, the operating system does recognize it, which means more swap space can be reserved than physically exists. The purpose of pseudo swap is to allow more processes to run in memory than could be supported by the swap device(s). It allows the operating system (specifically the SWAP_AVAIL variable) to recognize more swap space, thereby allowing additional processes to start when all physical swap has been reserved. By having the operating system recognize more swap space than physically exists, large memory customers can now operate without having to purchase large amounts of swap space, which they will most likely never use. The size of pseudo swap is dependent on the amount of memory in the system. Specifically, the size is (approximately) 75% of physical memory. This means the SWAP_AVAIL variable

Pseudo Swap

Definition: Pseudo swap is fictitious, make-believe, swapspace. It does NOT exist physically, but logicallythe operating system recognizes it.

Purpose: Pseudo swap allows more swap space to bemade available than physically exists.

Benefit: Pseudo swap adds “75% of physical memory”to the amount of swap space that the operating system thinks is available. This lessens swap space requirements (especially helpful on large memory systems.)

**NOTE: Pseudo swap is NOT allocated in memory!



7-13

will have an additional amount (75% of physical memory) added to its content. This additional amount allows more processes to start when the physical swap has been completely reserved. NOTE: Pseudo swap is enabled through a tunable OS parameter call swapmem_on.

If the value for swapmem_on is 1, then pseudo swap will be enabled (turned on). If the value for swapmem_on is 0, then pseudo swap will be disabled (turned off).

Analogy

A good analogy for pseudo swap is an airline overbooking a flight. Airlines know that customers sometimes don’t show up for their flight. If they reserved only enough seats for the plane, they would likely depart with a plane that wasn’t full – lost revenue. So they reserve more seats than actually exist on the plane, betting that a certain percentage of customers won’t show. That way they can fly a plane that is much closer to full and get more revenue. Of course, they are occasionally wrong.



7-14

7–9. SLIDE: Total Swap Space Calculation — with Pseudo Swap

Student Notes The above slide shows how “Total Available Swap Space” (also known as SWAP_AVAIL) is calculated with pseudo swap turned on. The SWAP_AVAIL variable is calculated as all of the configured physical swap space (device and file system swap) PLUS 75% of physical memory (pseudo swap). (The calculation of the size of pseudo swap is actually more complex than given here. The resultant value of pseudo swap can vary anywhere from 67% to 88% of physical memory. But we’ll use 75% as a pretty typical figure.) In our example, the total amount of physical swap was 55 MB, and the amount of physical memory was 32 MB. Since the size of pseudo swap is estimated at 75% of physical memory, the pseudo swap size in our example is 24 MB.

Total Swap Space Calculation —with Pseudo Swap

Memory Size = 32 MBx 0.75

Pseudo Swap = 24 MBPhysical Swap = 55 MBTotal Swap = 79 MB

+



7-15

This means the Total Available Swap Space (SWAP_AVAIL) is: 55 MB (Physical Swap) + 24 MB (Pseudo Swap) --------- 79 MB (Total Avail Swap)



7-16

7–10. SLIDE: Example Situation Using Pseudo Swap

Student Notes The above slide revisits our previous situation with pseudo swap turned ON. In our previous situation, we had swap space of 55 MB, of which 35 MB was in use and the remaining 20-MB was reserved. With pseudo swap turned OFF, we saw that no new processes could start because no physical swap space was available for reservation purposes. With pseudo swap turned ON, the total available swap space is 79 MB (not 55 MB). Therefore, when the system runs out of physical swap, it still has 24 MB (due to pseudo swap), which it thinks it can allocate and therefore can reserve. Consequently, the operating system is able to support more processes without having to allocate more physical swap space. This is important for large memory customers who do not want to purchase a lot of swap space on disk in order to support the large memory.

Example Situation Using Pseudo Swap

New program wants to execute; not enough memory for program to fit. With pseudo swap turned ON, program can now execute!


CPU

Disk

Processes

Kernel andOS Tables

Program

Reserved : 20 MB

Used : 35 MB

Swap Avail : 0 MB

Allocation withoutPseudo Swap

Swap (55 MB)

Reserved : 20 MB

Used : 35 MB

Swap Avail : 24 MB

Allocation withPseudo Swap



7-17

7–11. SLIDE: Swap Priorities

Student Notes When the HP-UX operating system needs to page something from memory to a swap device, it selects the smallest-numbered, strongest-priority swap device. A system administrator can define a priority number for each swap device on the system. The priority numbers range from 0 to 10, with 0 being the strongest priority, and 10 being the weakest priority. If multiple swap devices are available when the system needs to page out to swap, the strongest priority swap device is used. The slide shows two examples. The first example illustrates how the system behaves when two equal priority swap devices are available. In this situation, the system alternates between the two swap devices, with the first chunk of swap being allocated on swap device #1, and the second chunk of swap being allocated on swap device #2. The second example illustrates how the system behaves when two unequal priority swap devices are available. In this situation, the system will continue to allocate chunks of swap from the lowest-numbered (strongest priority) swap device. Only when that device is 100% full will the system begin allocating chunks from the second swap device.

Swap Priorities

1st chunk of swap - disk 1, chunk 12nd “ “ disk 2, chunk 13rd “ “ disk 1, chunk 24th “ “ disk 2, chunk 25th chunk will be allocated here

Equal Priorities Unequal Priorities

Swap - Priority 1 Swap - Priority 2Swap - Priority 1Swap - Priority 1

1st chunk of swap - disk 1, chunk 12nd “ “ disk 1, chunk 23rd “ “ disk 1, chunk 34th “ “ disk 1, chunk 4

5th chunk will be allocated here

3 421 1

23

4



7-18

7–12. SLIDE: Swap Chunks

Student Notes A swap chunk is the amount of space that the operating system allocates swap devices. The default swap chunk size is 2 MB. In the above example, two equal priority swap devices are available to the system. The system will allocate the first swap chunk to be on swap device #1, and this size will be 2 MB by default. Once this swap chunk has been filled by 512 pages (page size = 4 KB), then the system will allocate a second swap chunk to be on swap device #2. The system continues alternating swap space between the two systems in swap chunk increments. Swap chunks are also the unit in which swap space is allocated on file system swap devices. With file system swap devices, the operating system will only allocate swap space on the file system if the space is needed; if it does not need the swap space, then it does not allocate space. When it does need swap space, it allocates the file system swap space in swap chunk sizes. Files are created – each of a size equal to a swap chunk – and named hostname.N, where N is a number from 0 on up.

Swap Chunks

Swap - Priority 1 Swap - Priority 1

3 4

21

Space on the swap device is allocated to the kernel in incrementscalled swapchunks. The default swapchunk size is 2 MB.



7-19

7–13. SLIDE: Swap Space Parameters

Student Notes There are two configurable parameters and one fixed, non-configurable parameter that affect swap space configurations and allocations. DEV_BSIZE The size in bytes of a block of disk space. The default size is 1 KB. It is not configurable. swchunk The number of blocks (of size DEV_BSIZE) to associate with a

“chunk” of swap space, referred to as a swap chunk. The default value is 2048 blocks or 2 MB. The maximum value is 65,536 or 64 MB.

maxswapchunks This is the maximum number of swap chunks that will be recognized

systemwide. The default value is 256. The maximum value is 16,384. Using these defaults, the maximum amount of swap space that the operating system recognizes is 512 MB. This means if a system is configured physically for 1 GB of swap space, only 512 MB of the 1 GB will be used by the system. In order for the system to use the other 512 MB, the tunable OS parameter maxswapchunks needs to be increased to 512.

Swap Space Parameters

Total swap space recognized by the kernel =

maxswapchunks x swchunk x DEV_BSIZE

256 x 2048 x 1024 = 512 MBDefaults:

DEV_BSIZE Device block size. This is the size (in bytes) of ablock on the disk. The default size is 1024 bytes.

swchunk This is the number of blocks to allocate to the kernelwhen it need swap space. The default is to allocate swap space to the kernel in 2-MB increments. The default value is 2048. The maximum value is 65,536.

maxswapchunks This is the maximum number of swchunks whichcan be allocated to the kernel. The default value is 256. The maximum value is 16,384.



7-20

If you were to install HP-UX on a system that had 2 GB of physical memory, the installation process would automatically increase maxswapchunks to accommodate the larger memory. In this example, it would set maxswapchunks to 1024. However, if you were to add more memory at a later date (without reinstalling the kernel), you would have to manually tune maxswapchunks to be able to allocate enough swap space and use all of your available memory. Or, use pseudo swap. In 11.23 (11i v2), maxswapchunks has been eliminated and no longer becomes an issue.



7-21


Student Notes To summarize this module, all processes must reserve swap space by decrementing a variable called SWAP_AVAIL when they initialize. If this variable cannot be decremented, the process will not be able to start. To allow this variable to recognize more swap space than physically exists, setting a tunable parameter, swapmem_on, to 1 will turn on pseudo swap. This allows more processes to execute than the amount of swap space can support. This is not considered a problem on large memory systems, because these machines are not expected to swap. If a system does need to swap, it will swap to the lowest-numbered (strongest) priority swap device first. The priority of a swap device is specified when the device is activated. If two swap devices have the same priority, the system will alternate between the two devices. Swap chunks are the unit of disk space by which swap space is allocated. By default, the size of a swap chunk is 2 MB. By default, the system recognizes a maximum of 512 MB of swap space. If more swap space exists, the tunable parameter, maxswapchunks, must be increased, in order for the additional swap space to be recognized. If maxswapchunks is already set to the maximum value, then increase the value of swchunk.

Summary

• Swap space reservation

• Pseudo swap

• Swap priorities

• Swap chunks

• Swap space parameters



7-22

7–15. LAB: Monitoring Swap Space

Preliminary Steps A portion of this lab requires you to interact with the ISL and boot menus, which can only be accomplished via a console login. If you are using remote lab equipment, access your system’s console interface via the GSP/MP. You may get some “file system full” messages while you are shutting down the system. You can ignore these messages.

Directions

The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities.

1. Use the swapinfo command to display the current swap space statistics on the system. List the MB Avail and MB Used for the following three items:

MB Available MB Used

dev

reserve

memory

2. To see total swap space available and total swap space reserved, enter:

# swapinfo -mt

What is the total swap space available (including pseudo swap)? What is the total space reserved?



7-23

3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so let’s not use the –m option.

Upon verification, exit the shell. Is the swap space returned upon exiting the shell process?

4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Utilization percentage increases in glance. Type:

# /home/h4262/memory/paging/mem256 & Use the process that most closely matches your physical memory size. This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Utilization in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? What prevented an additional mem256 process from being started? Kill all mem256 processes to restore performance.

5. Recompile the kernel, disabling pseudo swap. Use the following procedure:

11i v1 or earlier:

# cd /stand/build # /usr/lbin/sysadm/system_prep -s system # echo "swapmem_on 0" >> system # mk_kernel -s ./system # cd / # shutdown -ry 0



7-24

11i v2 and later: # cd / # kctune swapmem_on=0 NOTE: The configuration being loaded contains the following change(s) that cannot be applied immediately and which will be held for the next boot: -- The tunable swapmem_on cannot be changed in a dynamic fashion. WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? no NOTE: The backup will not be updated. * The requested changes have been saved, and will take effect at next boot. Tunable Value Expression swapmem_on (now) 1 Default (next boot) 0 0 # shutdown –ry 0

6. Reboot from the new kernel.

Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test

7. Once the system reboots, login and execute swapinfo.

Is there a memory entry? Why or why not? Will the same number of mem256 processes be able to execute as earlier? How many mem256 processes can be started now? Kill all mem256 processes to restore performance.

8. If you have a two disk system.

If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk continue this lab at question 13. If you did not add the second disk earlier:



7-25

# vgdisplay –v | grep Name (Note the physical disks used by vg00) # ioscan –fnC disk (Note which disks are unused) # pvcreate –f <raw_dev_file_of_unused_disk> # vgextend /dev/vg00 <block_dev_file_of_second_disk>

To create the new swap device on the second disk: # lvcreate –n swap1 /dev/vg00 # lvextend –L 512 /dev/vg00/swap1 <dev_file_of_second_disk>

Note in our case the primary swap is 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap.

9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work.

# swapon –p 1 /dev/vg00/swap1 swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging. Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use.) You need to override. # swapon -p 1 -f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1. Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify “maxswapchunks” and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel, increasing maxswapchunks. Use the following procedure: # cd /stand/build # echo "maxswapchunks 512" >> system # mk_kernel -s ./system # cd / # shutdown -ry 0

10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system. Otherwise, skip to step 11.

Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test



7-26

And now add the new swap device: # swapon -p 1 -f /dev/vg00/swap1 Verify that the new swap space has be recognized by the kernel:

# swapinfo -mt Done!

11. Start enough mem256processes to make the system start paging.

12. Measure the disk I/O to see what is happening with swapspace. Go to question 15 when you have finished.

13. If you have a single disk system.

Create three additional swap devices with sizes of 20 MB. # lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00 List the current amount of swap space in use. If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out? A) The distribution would be 10 MB and 10 MB. B) The distribution would be 15 MB and 5 MB. Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.



7-27

14. Activate the newly created swap devices. Activate two with a priority of 1, and the third with a priority of 2.

# swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Is the new paging activity being distributed evenly across the paging devices?

15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo swap and remove the additional swap devices.

For 11i v1 and earlier, follow this procedure: # cd / # shutdown –ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown –ry 0



7-28


8-1


Objectives


• List three ways disk space can be used.

• List disk device files.

• Identify disk bottlenecks.




8–1. SLIDE: Disk Overview

Student Notes Disks are used to store data for the operating system and the applications. A disk can be used several different ways, but they boil down to just two – file system and raw. If a disk holds a file system, there are several structures which are built on the disk (using the data blocks of the disk) to help support the software in the kernel, which needs to access and manage the file system files and their contents. If the disks are to be used raw (such as a device swap space or an application database), no kernel structures are built out on the disk. The related code simply reads, manages and organizes the data blocks as it sees fit. There are several types of file systems available with the HP-UX 10.x and 11.x releases. The two primary types of local file systems are HFS (High performance File System), which was the original file system for HP-UX and has continually been enhanced since, and JFS (Journaled File System), which was introduced with the HP-UX 10.01 release and continues to grow in popularity and functionality. In the near future, you should see another type of file system become available for HP-UX – the Advanced File System (AdvFS) ported over from Tru64 UNIX. In later modules, we will

Disk Overview

Physical View

Tracks

Logical View Internal Cylinder View

Cylinder 0

Cylinder 1

Cylinder 2

Cylinder N-1

.

.

.

Data Blocks



8-3

discuss the performance issues that pertain to each of the available file systems. In this module, we’ll address the issues pertaining to all disks.

Physical View

From a physical disk perspective, the disk drives upon which a file system is placed contains sectors, tracks, platters, and read/write heads. A key behavior of most all disk drives is that the read/write heads move in parallel across the platters in such a way that each read/write head is over the same track within each platter at the same time. To maximize the I/O throughput of the disk, it is desirable to minimize the amount of head movement. To help achieve this goal, all the sectors in a cylinder are addressed in sequential order.

Cylinder Analogy

Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder would be all the same lanes from each floor's jogging track. In other words, all lane 1 tracks would make up cylinder 1; all lane 2 tracks would make up cylinder 2, etc. By organizing space on disks in cylinders, the software can logically distribute its sectors across all platters of the disk evenly and uniformly. For example, in the slide above, the first 6 sectors would be allocated as follows:

block #1: Platter #1, Track #1, Sector #1 block #2: Platter #1, Track #1, Sector #2 block #3: Platter #1, Track #1, Sector #3 block #4: Platter #1, Track #1, Sector #4 block #5: Platter #1, Track #1, Sector #5 block #6: Platter #1, Track #1, Sector #6

By allocating disk space in this manner, a multiple block read (say 6 blocks) could be read in one operation.

Logical View

From a logical view, each cylinder is simply a repository for a certain amount of data, which can be read or written without having to move the heads. This data area is further broken down into blocks. The block is the most fundamental unit of data that can be read from or written to the disk. We mentioned in an earlier chapter a value in the kernel, called DEV_BSIZE. It is equal to 1024 bytes. This is the block size from the kernel’s perspective. The disk can be viewed as simply a series of blocks running from block 0 to block N-1, where N is the total number of blocks on the disk. The closer two blocks are to each other, the more likely they will be in the same cylinder. If they are in the same cylinder, a minimum amount of time is needed to read or write both blocks.



8–2. SLIDE: Disk I/O — Read Data Flow

Student Notes Up to this point, we have looked at I/O from the standpoint of the disk. The following slide illustrates disk I/O activities from the standpoint of memory and the process initiating the I/O. The assumption here is that we are dealing with a disk that has a file system on it, so the buffer cache becomes a factor in the operation. If this were a raw disk, the buffer cache would be bypassed by all I/O operations.

Asynchronous vs. Synchronous Reads

There are two possible approaches to doing reads – synchronous and asynchronous. By, default any read will be synchronous, i.e., the process will wait (and sleep, if necessary) until the data can be transferred to the data area of the process. If the read is asynchronous, the process informs another driver (an asynchronous I/O driver) that it will need certain data in the future. The driver fetches the data from the disk and places it in the buffer cache, while the process continues with other operations. When the data is in the cache, the driver signals the process and the read is now executed. The data is guaranteed to be in the buffer and the process never has to sleep. Asynchronous reads are significantly more difficult to program, so they are used only in the more sophisticated applications.

Disk I/O — Read Data Flow

1. Process issues read system call (logical I/O generated).2. Block to be read is not in buffer cache; physical I/O is issued.3. Block on disk is accessed through seek, latency, and transfer.4. Data is read into buffer cache, completing physical I/O request.5. Data is returned to process, completing the logical I/O and system call.

Filesystem

Memory

FileProcess

Buffer Cache

Disk I/O Queue

15

4

3

2

Seek, Latency,Transfer



8-5

Buffered Read Data Flow

The flow diagram on the slide highlights the main actions from the time a process issues a read() system call, to when the data is returned to the process. 1. A process issues the read() system call. This is viewed by the kernel as a logical I/O,

meaning the kernel will satisfy the request any way it can, either through the buffer cache or by performing a physical I/O.

2. The buffer cache is searched, looking for the data blocks being requested. If the data

block is found in the buffer cache, the read() system call is returned with the corresponding data. If the data block is not found, The requesting process goes to sleep and a physical I/O request is generated to read the data block into the buffer cache. We will assume the data block was not found.

NOTE: Logical I/Os may or may not generate corresponding physical I/Os. The goal of

the buffer cache is to handle as many logical I/Os with as few physical I/Os as possible.

3. The physical read is performed because the data was not in the buffer cache. Because

physical I/O involves movement of the disk head (seek time), waiting for the data on the platter to rotate under the disk head (latency time), and moving the data from the platter into memory (transfer time), the cost of a physical I/O is high from a performance standpoint. Physical I/Os are the most time-consuming operations that the kernel performs.

If the disk I/O queue is long (3 or more requests), the time spent waiting to be serviced can be longer than the time to actually service the I/O request.

4. Once the physical I/O request returns, the data is stored in the buffer cache so that future

I/O requests for the same file system block can be satisfied without having to perform another physical I/O. This step completes the physical I/O initiated by the kernel.

5. The final step is to return the data to the original calling process that issued the read().

The sleeping process is awakened and transfers the desired data from the buffer (in buffer cache) to the data area of the process. Then the process returns from the read() system call. This step completes the logical I/O initiated by the process.

Raw Read Data Flow

If the read operation is raw, the buffer cache is bypassed. Data is transferred directly from the disk to the data area of the calling process. All raw reads are synchronous and therefore result in the process sleeping until the data has been read in.



8–3. SLIDE: Disk I/O — Write Data Flow (Synchronous)

Student Notes As with reads, there are two methods for performing write() system calls: asynchronous and synchronous. Although the default write operation is asynchronous (the writing process does not sleep – waiting for the write to complete), it is quite simple for a program to choose synchronous writes. It can be done by simply setting a flag on the open file before issuing the write. This can be done when the file is opened or at some later time.

Synchronous Writes

The slide shows the data flow of a synchronous write, from the time the write()system call is issued, to when the write call returns to the process. 1. The process issues a synchronous write() system call. 2. Assuming the process is writing to a new file data block, a new file system block is

allocated on disk and an image of that block is allocated in the buffer cache. 3. Once the data is copied from the data area of the process to the buffer cache, an I/O

request is placed in the disk I/O queue for that particular disk. The calling process goes to sleep until the write is reported to be complete.

Disk I/O — Write Data Flow (Synchronous)

1. Process issues write system call. 2. Block is assigned on disk, and image for block is allocated in buffer cache.3. Once data is written to buffer cache, a physical I/O to disk is generated.4. Data is written to disk controller cache.5. Data is then transferred from the disk controller to the corresponding platter.6. Upon completion of I/O, the disk controller sends an acknowledgment to the

kernel.7. Write system call returns to process.

Memory

Process

Buffer Cache

Disk I/O Queue

71

4

3

Disk ControllerCache

2

5

6



8-7

4. When the physical write is performed, the data is first copied from the buffer cache to the

firmware cache on the disk drive controller. NOTE Most SCSI disk drive controllers can be configured to return an I/O complete

acknowledgment at this point, rather than waiting for the data to be transferred to the physical platters. This condition is called “immediate reporting”.

5. The data is transferred from the disk controller cache to the platter. This operation is

often the most time consuming part of the write, as it involves seek, latency, and data transfer operations.

6. Once the data has been successfully transferred to the platters, the disk drive controller

returns an I/O complete acknowledgment to the kernel (assuming this was not done in step 4 with immediate reporting).

7. The kernel, upon receiving the I/O complete acknowledgment, Wakes the sleeping

process, which then returns from the write call.

Asynchronous Writes

An asynchronous write does not wait for the data to get to the disk. An asynchronous write system call returns immediately upon the data being written to the buffer cache. In the diagram on the slide, the write call would return following step 2. The advantage of asynchronous writes is performance — the process does not have to wait for the physical I/O. The disadvantage is lack of data integrity. Because the process continues executing before the data is written to disk, it can perform additional actions that are dependent upon the data being written successfully. If for some reason the data does not get written (a disk goes offline or a disk head crashes), the additional actions can leave the system in an inconsistent state. For example, assume a database record is written asynchronously. Because it is written asynchronously, the database process continues its execution. A subsequent action is to update a corresponding entry in another table of the database located on another disk. Assume the first asynchronous write is posted to a busy disk with a long queue, and the subsequent write is posted to a disk with an empty queue. The second write finishes before the first write begins! If the system were to crash after the second write, but before the first write, the database would be out-of-sync and corrupted, because the second write assumed that the first write succeeded. There is no “signaling” to the writing process to let it know that a write has completed. For that, the process would have to do synchronous writes.



8–4. SLIDE: Disk Metrics to Monitor — Systemwide

Student Notes When monitoring disk I/O activity, the main metrics to monitor are:

• Percent utilization of the disk drives: As utilization of the disk drives increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the disk is 50% busy, than it does when the disk is idle. Therefore, we consider that a disk may be experiencing a bottleneck if the disk is 50% busy or more.

• Requests in the disk I/O queue: The number of requests in the disk I/O queue is one of the best indicators of a disk performance problem. If the average number of requests is above two, then requests are forced to wait in the queue longer than the amount of time needed to service their own requests. If the average number of requests is three or greater, you should also see that the average wait time for a request is greater than the average service time.

• Amount of physical I/O: If the amount of disk activity is high, it is important to investigate on which disk, which logical volume, and which file system the activity is occurring on.

Disk Metrics to Monitor — Systemwide

• Utilization of disk drives• Disk I/O queue length• Amount of physical I/O to

– Device (i.e., Disk)– Logical volume– File system

• Buffer cache hit ratio



8-9

• Buffer cache hit ratio: One reason disk activity could be high is that read or write requests are not finding corresponding disk blocks in the buffer cache. As a result, physical I/O requests are being generated to the disk. The read cache hit ratio on the buffer cache indicates how frequently read data is found in the buffer cache. The minimum read hit ratio should be 90% or higher for optimal performance. Less than 90% indicates the buffer cache may be too small, causing (potential) excess disk activity. It may also indicate that the application is not using buffer cache in an efficient manner, e.g. doing a lot of random I/O or very large I/O. The write cache hit ratio on the buffer cache indicates how frequently a write to a buffer does not trigger a physical read or write to the disk. (if only a portion of a block is being written, and the image of that block is not already in a buffer, it may be necessary to read the original contents of the block into buffer cache before modifying it with the new write data.) The minimum write cache hit ratio should be 70% or higher for optimal performance. Less than 70% indicates the buffer cache may be too small, causing (potential) excess disk activity. Again, the fault may lie with the application’s use of the buffer cache.



8–5. SLIDE: Disk Metrics to Monitor — Per Process

Student Notes On a per process basis, it is important to identify which processes are generating large amounts of disk I/O. Metrics that help to identify I/O activity on a per process basis are: • Amount of physical and logical I/O: This indicates “how much” I/O the process is

performing. For processes performing large amounts of I/O, the additional three metrics shown below should be investigated.

• Type and amount of I/O related system calls being generated: For each process

performing high I/O, the number of read(), write(), and other I/O related calls should be inspected.

• Amount of VM reads and VM writes: If the I/O activity being generated is due to

paging (VM read and VM writes), then the problem is probably not a disk I/O problem, but more like a memory problem.

• Files opened with heavy access: For each process performing large amounts of file

system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, consider relocating these files to other disks that are less busy. To determine how “random” the I/O requests are, hit <CR>

Disk Metrics to Monitor — Per Process

• Amount of physical and logical I/O being performed on a per process basis

• Type and amount of system calls (I/O-related) being generated by processes performing large amounts of I/O

• Paging to swap device (VM read/writes) on a per process basis

• Files opened by processes performing large amounts of I/O



8-11

frequently while looking at the list of open files for that process (in glance), then inspect how quickly the offset to each file changes and whether it is monotonically increasing or varies up and down.



8–6. SLIDE: Activities that Create a Large Amount of Disk I/O

Student Notes Common causes of disk-related performance problems are shown on the slide.

• Buffer cache misses cause physical I/O to occur. When the appropriate buffer is not found in the buffer cache, a physical I/O is triggered. By the way, a buffer cache can be too large as well. A very large buffer cache takes more time to search to see if the appropriate buffer exists! More on how to properly size a buffer cache will be given later in this module.

• Synchronous I/O forces the write system calls to wait until the I/O physically completes. Very good for data integrity, very poor for performance.

• Sequential access, with a small block size, causes excessive amounts of physical I/O.

• Accessing lots of files on one disk, versus many disks, creates an imbalance of disk drive utilization. This leads to performance problems with the busy disks and under utilization with the less busy disks.

• Accessing lots of disks on the same disk controller creates contention problems on the SCSI bus. You can determine this by noticing that multiple disks on the same controller

Activities that Create a Large Amount of Disk I/O• Buffer cache misses

• Synchronous I/O

• Accessing sequentially with a small block size

• Accessing many files on a single disk

• Accessing many disk drives from a single disk controller card



8-13

have request queues that are consistently three or greater in length and the average time a request waits to be serviced is greater than the average time it takes to actually service the request. The individual disks may not show a disk utilization 50% or greater! If this situation occurs, it would be best to spilt up the busiest disks onto separate controllers.



8–7. SLIDE: Disk I/O Monitoring sar –d Output

Student Notes The sar -d report shows disk activity on a per disk drive (spindle) basis. The key fields within this report are: % busy Indicates the average percent utilization of the disk over the interval (5

seconds in the slide). avque Indicates the average number of requests in the disk I/O queue. avwait Indicates the average amount of time a requests spends waiting in the disk I/O

queue. avserv Indicates the average amount of time to service a disk I/O request. The sar -d report on the slide shows that when the disk had the most requests in the queue (19.60 and 18.77), the average wait time was at its highest. The slide also shows that there are five disk drives spread across two disk controllers. One disk controller (c0) appears to have two busy drives (t4 and t6), and a relatively low usage drive (t5). Disk controller (c1) has two disks that are mainly idle. One performance solution

Disk I/O Monitoring sar -d Output

# sar -d 5 605:23:50 device %busy avque r+w/s blks/s avwait avserv05:23:55 c1t5d0 0.60 0.50 2 35 1.55 5.07

c0t4d0 62.40 10.51 46 2783 127.97 152.92c0t5d0 33.20 2.76 16 1226 42.89 143.96c0t6d0 54.80 8.10 31 2166 242.52 193.15

05:24:00 c1t5d0 1.20 0.50 3 39 1.97 6.72c0t4d0 63.80 10.84 48 2943 129.23 159.47c0t5d0 39.20 2.94 19 1427 38.85 154.55c0t6d0 61.80 19.60 36 2371 331.15 208.49

05:24:05 c1t5d0 2.20 0.50 3 45 3.85 13.04c0t4d0 56.40 18.40 39 2392 234.33 163.10c0t5d0 35.60 2.69 17 1258 39.96 138.81c0t6d0 62.80 18.41 36 2643 192.28 178.66

05:24:10 c1t5d0 0.20 0.50 2 35 1.01 4.86c0t4d0 68.60 13.00 51 3118 154.68 159.02c0t5d0 33.80 3.25 16 1226 47.82 147.32c0t6d0 60.00 5.72 33 2301 238.43 203.88

05:24:15 c0t4d0 24.40 4.25 15 823 60.83 180.68c0t5d0 23.00 3.46 14 851 43.33 118.87c0t6d0 50.60 18.77 28 1846 306.13 233.36

05:24:20 c1t6d0 0.60 0.50 0 2 4.63 11.53c1t5d0 1.40 1.17 2 23 9.85 21.50



8-15

here would be to balance the disk activity across the two controllers by moving one disk (say c0t4) over to the less busy disk controller (c1).



8–8. SLIDE: Disk I/O Monitoring sar –b Output

Student Notes The sar -b report shows disk activity related to the buffer cache. The key fields within this report are: bread/s Indicates the average number of physical I/O reads per second over the

interval. The term bread refers to block reads. lread/s Indicates the average number of logical I/O reads per second over the

interval. %rcache Indicates the average percent read cache hit rate. This shows what

percentage of read requests were satisfied through the buffer cache. Ideally, this value should be consistently 90% or greater.

bwrit/s Indicates the average number of physical I/O writes per second over the

interval. The term bwrit refers to block writes. lwrit/s Indicates the average number of logical I/O writes per second over the

interval.

Disk I/O Monitoring sar -b Output

#=> sar -b 10 20

HP-UX e2403roc B.10.20 U 9000/856 02/09/98

05:51:04 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s05:51:14 0 0 0 1 1 25 0 005:52:04 0 0 0 0 1 85 0 005:52:14 0 0 0 1 8 87 0 005:52:24 0 0 0 0 4 100 0 005:52:34 0 0 0 0 1 100 0 005:52:54 1 68 99 0 0 33 0 005:53:04 7 11936 100 1 2 13 0 005:53:14 6 19506 100 1 1 0 0 005:53:24 28 24147 100 1 2 65 0 005:53:34 64 16659 100 0 14 99 0 005:53:44 118 118 0 2 3 46 0 005:53:54 0 0 0 3 3 0 0 005:54:04 0 0 0 18 19 4 0 005:54:14 179 179 0 18 18 3 0 005:54:24 179 179 0 13 14 4 0 0

Average 29 3639 99 3 5 39 0 0



8-17

%wcache Indicates the average percent write cache hit rate. This shows what percentage of write requests were satisfied through the buffer cache. Ideally, this value should be consistently 70% or greater.

The sar -b report on the slide shows the two extreme situations. The first extreme is a 100% cache hit rate, which occurs when there are lots of logical I/O requests and all requests are satisfied through the buffer cache, rather than having to go to disk. This is a very desirable condition. The other extreme is a 0% cache hit ratio. This occurs when every logical I/O request required a physical I/O from disk. In this case, the number of physical reads or writes is equal to the number of logical reads or writes. This is most undesirable.



8–9. SLIDE: Disk I/O Monitoring glance — Disk Report

Student Notes The glance disk report (d key) shows local and remote I/O activity. The I/O distribution can be viewed from the following:

• Logical Perspective (logical reads and logical writes)

• Physical Perspective (physical reads and physical writes)

• I/O Type Perspective (User, Virtual Mem, System, Raw)

Items of interest in this report include the number of logical I/O requests (read and writes), the number of physical I/O requests (reads and writes), and the ratio between the two. In the slide, disk utilization is 94% (very high), with the majority of the I/Os being writes (92%) as opposed to reads. It is also interesting to note the logical to physical write ratio is 14,798 / 3,520 or approximately 4:1, which is an acceptable write performance ratio.

Disk I/O Monitoring glance — Disk Report

B3692A GlancePlus B.10.12 06:16:25 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------Cpu Util |100% 100% 100%Disk Util | 83% 22% 84%Mem Util | 94% 95% 96%Swap Util | 21% 21% 22%--------------------------------------------------------------------------------

DISK REPORT Users= 4Req Type Requests % Rate Bytes Cum Req % Cum Rate Cum Byte--------------------------------------------------------------------------------Local Logl Rds 68 2.7 13.6 5kb 1260 7.8 9.6 3.2mb

Logl Wts 2455 97.3 491.0 19.2mb 14798 92.2 112.9 114.8mbPhys Rds 10 1.7 2.0 80kb 189 5.1 1.4 1.8mbPhys Wts 565 98.3 113.0 18.9mb 3520 94.9 26.8 112.4mbUser 571 99.3 114.2 18.9mb 3448 93.0 26.3 112.2mbVirt Mem 0 0.0 0.0 0kb 66 1.8 0.5 968kbSystem 4 0.7 0.8 32kb 195 5.3 1.4 1.2mbRaw 0 0.0 0.0 0kb 0 0.0 0.0 0kb

Remote Logl Rds 0 0.0 0.0 0kb 0 0.0 0.0 0kbLogl Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kbPhys Rds 0 0.0 0.0 0kb 1 100.0 0.0 0kbPhys Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kb

S S R U U

F F

S S U U B B

U U R R



8-19

8–10. SLIDE: Disk I/O Monitoring glance — Disk Device I/O

Student Notes The glance disk device report (u key) shows current and average utilization of each disk drive on the system. The report also shows the current I/O queue length for each disk. This display shows basically the same information as sar –d. In the slide, three disks show utilization greater than 50% and queue lengths greater than 3. This is normally a valid reason for further investigation. The 10.6 and 18.2 queue lengths are high, but, because the average utilization of both the drives is 9%, this may just be a spike in disk activity. In this case, monitor the situation further to see if the high queue lengths persist or if they were just spikes in disk usage.

Disk I/O Monitoring glance —Disk Device I/O


IO BY DISK Users= 4Idx Device Util Qlen KB/Sec Logl IO Phys IO--------------------------------------------------------------------------------

1 56/52.6.0 0/ 0 0.0 0.0/ 1.8 na/ na 0.0/ 0.22 56/52.5.0 1/ 1 0.0 16.0/ 5.1 na/ na 2.0/ 0.73 56/36.4.0 78/ 9 18.2 1584.8/ 178.4 na/ na 48.0/ 5.64 56/36.5.0 52/ 6 3.8 932.8/ 120.5 na/ na 24.0/ 3.05 56/36.6.0 68/ 9 10.6 1172.8/ 154.9 na/ na 35.8/ 4.66 56/52.2.0 0/ 0 0.0 0.0/ 0.0 0.0/ 0.0 0.0/ 0.0

Top disk user: PID 3280, disc 106.4 IOs/sec S - Select a Disk

S S R U U

F F

S S U U B B

U U R R



8–11. SLIDE: Disk I/O Monitoring glance — Logical Volume I/O

Student Notes The glance logical v volume report (v key) shows disk activity on a per logical volume basis. Only physical I/O activity (not logical I/O activity) is shown with this report. In the previous slide, we saw high activity across three disk drives (drives 4, 5, and 6). The logical volume report on the slide shows all this activity is being performed against one logical volume (/dev/vg01/lvol1), which implies that the logical volume is being spread across three disks (a good idea since the I/O to the logical volume is so high).

Disk I/O Monitoring glance —Logical Volume I/O


IO BY LOGICAL VOLUME Users= 4Idx Vol Group/Log Volume Open LVs LV Reads LV Writes--------------------------------------------------------------------------------

1 /dev/vg00 10 0.0/ 0.0 0.0/ 0.02 /dev/vg00/group 0.0/ 0.0 0.0/ 0.03 /dev/vg00/lvol3 0.0/ 0.0 0.2/ 0.04 /dev/vg00/lvol2 0.0/ 0.0 0.0/ 0.05 /dev/vg00/lvol1 0.0/ 0.0 0.0/ 0.09 /dev/vg00/lvol7 0.0/ 0.0 0.0/ 0.0

10 /dev/vg00/lvol4 0.0/ 0.0 0.0/ 0.012 /dev/vg01 2 0.0/ 0.0 0.0/ 0.013 /dev/vg01/lvol1 0.0/ 0.0 105.6/ 19.2

Open Volume Groups: 2 S - Select a Volume

S S R U U

F F

S S U U B B

U U R R



8-21

8–12. SLIDE: Disk I/O Monitoring glance — System Calls per Process

Student Notes The glance system calls report (L key), available only from the select process report (s key), shows the names of the system calls being generated by the selected process. The system calls report can be viewed for individual processes (as shown on the slide), or globally for all processes on the system (Y key). Significant system calls, which typically consume a lot of time, are the file I/O related calls, such as read(), write(), open(), and close(). In the slide, the write() system call is being invoked heavily by the selected process (754 times/second) and has accounted for 4.1 seconds of the CPU's time over a 27-second period (approximately 15%).

Disk I/O Monitoring glance —System Calls per Process

B3692A GlancePlus B.10.12 06:48:15 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------Cpu Util |100% 100% 100%Disk Util | 83% 22% 84%Mem Util | 94% 95% 96%Swap Util | 21% 21% 22%--------------------------------------------------------------------------------System Calls for PID: 4055, disc PPID: 2410 euid: 0 User:root

Elapsed ElapsedSystem Call Name ID Count Rate Time Cum Ct CumRate CumTime--------------------------------------------------------------------------------write 4 377 754.0 0.10650 12851 477.7 4.10153open 5 3 6.0 0.05910 100 3.7 0.61923close 6 3 6.0 0.00006 100 3.7 0.00225lseek 19 0 0.0 0.00000 75 2.7 0.00204ioctl 54 3 6.0 0.00007 100 3.7 0.00259vfork 66 0 0.0 0.00000 25 0.9 0.34908sigprocmask 185 0 0.0 0.00000 50 1.8 0.00088sigaction 188 0 0.0 0.00000 150 5.5 0.01340waitpid 200 0 0.0 0.00000 25 0.9 1.47745

Cumulative Interval: 27 secs

S S R U UF F

S S U U B B

U U R R



8–13. SLIDE: Tuning a Disk I/O-Bound System — Hardware Solutions

Student Notes The hardware solutions on the above slide will help to lessen the performance impact of high disk I/O on a system.

• Add more disk drives and load balance across disks. This spreads the amount of I/O over more drives, decreasing the average number of I/O requests for each disk. Many smaller disks are better than a few large disks.

• Add more disk controllers and balance load across disk controllers. This spreads the amount of I/O over more controllers, decreasing the likelihood that any one disk controller will become overloaded with I/O requests.

• Add faster disk drives. This decreases the amount of time it takes to service an I/O request, which decreases the amount of time requests spend waiting in the disk I/O queue.

• Implement disk striping. This increases the number of disk heads having access to the striped data (the more disks striped across, the more heads accessing the data,

Tuning a Disk I/O-Bound System —Hardware Solutions• Add additional disk drives (and off load busy drives).

• Add additional controller cards (and balance disk drive load across controllers).

• Add faster disk drives.

• Implement disk striping.

• Implement disk mirroring.



8-23

simultaneously). It also allows for “overlapping seeks,” meaning that one disk head can be seeking the next block, while a second disk head is reading the current data block.

• Implement disk mirroring. This can increase read performance, as either the primary or mirrored copy of the data can be read. In fact, the data will be read from whichever disk has the fewest I/Os pending against it. However, it will negatively impact write performance. In order to maintain the integrity of the mirrors, duplicate writes must be done to each copy of the mirrored volume/disk. Mirroring is primary a data integrity feature, but under the right circumstances (read-intensive data) it can improve performance, as well.



8–14. SLIDE: Tuning a Disk I/O-Bound System — Perform Asynchronous Meta-data I/O

Student Notes Asynchronous I/O significantly improves write performance over synchronous I/O because the write requests (and thus the requesting processes) do not have to wait for the data to be written to the disk platters.

Immediate Reporting for Selected Disks

Immediate reporting can be turned on at boot time by setting the tunable parameter default_disk_ir to “ON”. An alternative to turning on default_disk_ir is to enable certain disk controllers selectively to report immediately to the kernel when the data reaches the disk controller cache. For normal writes, the disk waits until data is transferred from the controller cache to the disk platters, before returning to the kernel. By setting immediate reporting to ON for individual disk controllers, processes do not have to wait for the seek or latency times when writing to those disks. The scsictl command can be used to turn immediate reporting ON (1) for a particular SCSI disk. The default for immediate reporting is OFF (0).

Tuning a Disk I/O-Bound System —Perform Asynchronous I/O• Configure individual disk drives to behave “somewhat” asynchronously

with immediate reporting feature of SCSI disk controllers.

• Configure immediate reporting with the scsictl command.

Memory

Process

Buffer Cache

Disk I/O Queue

Disk ControllerCache



8-25

Examples

To view the device settings for the controller at SCSI adapter address "0" and SCSI target address 6:

# /usr/sbin/scsictl -m ir /dev/rdsk/c0t6d0 immediate_report = 0

To change the value of immediate reporting to ON:

# /usr/sbin/scsictl -m ir=1 /dev/rdsk/c0t6d0 To view the changes in the device settings:

# /usr/sbin/scsictl -a /dev/rdsk/c0t6d0 immediate_report = 1; queue_depth = 8



8–15. SLIDE: Tuning a Disk I/O-Bound System — Load Balance across Disk Controllers

Student Notes Another potential solution to a disk I/O performance problem is to spread the write requests across the disk controllers as evenly as possible. This helps ensure no one controller becomes overloaded with I/O requests.

Mirroring Logical Volumes

A popular feature of LVM is the ability to mirror logical volumes to separate disk drives. This involves writing one copy of the data to the primary disk and one copy to the mirrored disk. When the primary disk and mirror disk are on the same disk controller, a performance bottleneck often results because the disk controller has to service the writes for both the primary and mirrored data.

Physical Volume Groups

Physical volume groups (PVGs) allow disk drives to be grouped, based on the disk controller to which they're attached. Used in conjunction with LVM mirroring, it ensures the mirrored data not only goes to a different disk, but also goes to a different PVG group (that is, a different disk controller).

Tuning a Disk I/O-Bound System —Load Balance across Disk Controllers

Volume Group vg01

PVG1

PVG2

System

C0

C1



8-27

How to Set Up PVGs

The PVG groups are defined in the /etc/lvmpvg file. This file can be manually edited or updated with the -g option to the vgcreate and vgextend commands. A sample /etc/lvmpvg file, based on the four disks on the slide are:

VG /dev/vg01 PVG PV_group0 /dev/dsk/c0t6d0 /dev/dsk/c0t5d0 PVG PV_group1 /dev/dsk/c2t5d0 /dev/dsk/c2t4d0

Configuring LVM to Mirror to Different PVGs

The command to configure LVM mirroring for different PVGs is lvchange. The strict option to this command, -s, contains the following three arguments: y This indicates all mirrored copies must reside on different disks. n This indicates mirrored copies can reside on the same disk as the primary copy. g This indicates all mirrored copies must reside with different PVGs. For example, to configure /dev/vg01/lvol1 to mirror to different PVG:

lvchange -s g /dev/vg01/lvol1



8–16. SLIDE: Tuning a Disk I/O-Bound System — Load Balance across Disk Drives

Student Notes Balancing the disk activity so that the utilization across drives is approximately the same helps to ensure that no one disk becomes overloaded with I/O requests (that is, 50% or greater utilization, with three or more requests in the disk queue). The slide illustrates a situation in which one disk is heavily utilized (100%) while another disk is only 5% utilized. One potential solution is to stripe the heavily utilized logical volume on the first disk to both disks.

LVM Striping

The ability to stripe a logical volume across multiple disks (at a file system block level) was introduced into LVM at the HP-UX 10.01 release. A logical volume must be configured for striping at the time of creation. Once a logical volume is created, it cannot be striped without recreating the logical volume.

Tuning a Disk I/O-Bound System —Load Balance across Disk Drives

Volume Group vg01

System

100%Util

90%Util

5%Util

20%Util

Volume Group vg01

System

52%Util

90%Util

52%Util

20%Util

Without Striping With Striping

1 3 57 9 11

2 4 6 8 10 12

1 2 34 5 6



8-29

The command to create a striped logical volume is lvcreate. The syntax, related to striping, for this command is:

lvcreate -i [number of disks] -I [stripe size] -L [size in MB] vg_name Example:

lvcreate -i 2 –I 8 /dev/vg01 lvextend -L 50 /dev/vg01/lvol2 /dev/dsk/c0t5d0 /dev/dsk/c0t4d0



8–17. SLIDE: Tuning a Disk I/O-Bound System — Tune Buffer Cache

Student Notes With the introduction of HP-UX 10.0, the buffer cache becomes dynamic, growing and shrinking between a minimum size and a maximum size. NOTE: Space for the buffer cache is allocated in two different areas of memory: the

minimum size is created in the O/S area of memory, and anything above the minimum size is allocated from the User Process area.

How the Buffer Cache Grows

As the kernel reads in files from the file system, it will try to store the data in the buffer cache. If memory is available and the buffer cache has not reached its maximum size, the kernel will grow the buffer cache to make room for the new data. As long as there is memory available, the kernel will keep growing the buffer cache until it reaches its maximum size (50% of memory, by default). If memory is not available, or the buffer cache is at its maximum size when new data is read, the kernel will select buffer cache entries that are least likely to be needed in the future, and reallocate those entries to store the new data.

Tuning a Disk I/O-Bound System —Tune Buffer Cache

Memory

Additional Buffer Cache

0 - 45%

Fixed Buffer Cache5%

Kernel andOS Tables

User Processand SharedMemory Area

Defaults

dbc_min_pct=5%

dbc_max_pct=50%



8-31

The main point is that if there is available memory, the buffer cache will grow into this memory until there is no memory left (or until the buffer cache reaches its maximum size).

How the Buffer Cache Shrinks

As memory falls below LOTSFREE, the vhand-paging daemon wakes up and begins paging out 4-KB pages of memory. The eligible pages include process segments (text, data, and stack), shared memory segments, and the buffer cache. In other words, the buffer cache is shrunk by having vhand page out its pages. The buffer cache is treated by vhand as just another structure in memory with pages that it can dereference and free. Like process text pages, buffer cache pages are not written out to the swap space. But, since their contents may have been modified, they could be flushed out to the file system being placed back on the free page list. NOTE: It should be noted that the kernel global value, dbc_steal_factor, determines

how aggressive the vhand daemon is at stealing buffer cache pages in comparison to process pages. A value of 16 says to treat buffer cache pages no differently than process pages; the default value of 48 says to steal buffer cache pages three times as aggressively! However, if the buffer cache is referencing those pages, vhand will find few buffers to free up.

Buffer Cache Performance Implications

Because the buffer cache grows quickly into free memory and shrinks slowly by necessitating vhand to page it out, one consideration is to limit the maximum size to which the buffer cache can grow. The default maximum size is 50% of total memory. This probably was a fairly reasonable number when the parameter was introduced, but with the very large memory systems existing nowadays, it’s probably much too high. By setting the dbc_max_pct tunable kernel parameter to a smaller number (say 20 or 25), the buffer cache can still grow to a significant size, but will not be so large that it takes a long time to shrink when more processes become ready to execute. Prior to HP-UX11i, there was a definite performance penalty for having a buffer cache that was too large. It took a long time to search the cache to determine if the needed buffer was already there. Improvements in the search algorithm in 11i have reduced that penalty significantly.

Fixed vs. Dynamic Buffer Cache

Should you use a fixed-size buffer cache or a dynamic buffer cache? If your buffer cache requirements are constant over time, of course you should use a fixed-size buffer cache. Simply set the dbc_min_pct and dbc_max_pct parameters to the same value. If your buffer cache requirements change over time, do they change rapidly or slowly. There is some overhead associated with growing and shrinking buffer cache. Plus, shrinking buffer cache is not a very fast operation. If your buffer cache changes slowly over time, it would be best to use a dynamic buffer cache. The overhead of growing and shrinking would be spread out and become relatively insignificant.



If, however, your buffer cache requirements change rapidly over time, you probably would be better served with a fixed-size buffer cache, properly sized to give you adequate buffers most of the time. Only on relatively rare occasions, would buffer cache be a bottleneck and only for short periods. In the long run, your performance would be better than trying to deal with the rapidly changing needs using a dynamic buffer cache.

Sizing Buffer Cache

Here is a set of recommendations for properly sizing your buffer cache.

1. Are you getting at least a 90% read cache hit rate and a 70% write cache hit rate? If so, your buffer cache may already be larger than necessary. If you are experiencing no memory pressure, and no apparent disk bottlenecks, leave the buffer cache as it is.

2. If you are experiencing memory pressure or apparent disk bottlenecks, try shrinking the size of your buffer cache. Adjust dbc_max_pct down, in increments, no more than 10% at a time, until your performance figures fall to 90%/70%.

3. If you are not getting 90%/70% performance from your buffer cache, it may be too small or your application may be using it in an inefficient manner. Try increasing its size. If the figures improve, keep increasing the size until either you reach 90%/70% or your performance ceases to improve. Leave the size there.

4. If increasing the size of the buffer cache does not produce an immediate improvement in performance, your application may need to be tuned to use the buffer cache more efficiently. However, your buffer cache may still be larger than it needs to be. After you have tuned your application, recheck your buffer cache performance, as above.



8-33

8–18. LAB: Disk Performance Issues

Directions

The following lab illustrates a number of performance issues related to disks. 1. A file system is required for this lab. One was created in an earlier exercise. Mount it

now. # mount /dev/vg00/vxfs /vxfs We also need to assure that the controller does not have " SCSI immediate reporting" enabled. Enter the following command and check your current state: (fill in the device file name as appropriate) # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status)

If the current immediate_report = 1 then enter the following: # scsictl -m ir=0 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear)

2. Copy the lab files to the file system. # cp /home/h4262/disk/lab1/disk_long /vxfs # cp /home/h4262/disk/lab1/make_files /vxfs Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files

3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs



4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key.

From the first window, time how long it takes to read the files with the cat command. Record the results below:

# timex cat file* > /dev/null glance Disk Report real: user: Logl Rds: sys: Phys Rds:

5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same

command and record the results below:

# timex cat file* > /dev/null glance Disk Report real: user: Logl Rds: sys: Phys Rds:

NOTE: The conclusion is that I/O is much faster coming from the buffer cache,

than having to go to disk to get the data. 6. The sar -d report.

Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long

• How busy did the disk get? • What was the average number of request in the I/O queue? • What was the average wait time in the I/O queue? • How much real time did the task take?

7. The glance I/O by Disk report

Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long, timing the execution. Record results below:

# ./disk_long glance I/O by Disk Report Util: Qlen:



8-35

8. The glance I/O by File System report

Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long, timing the execution. Record results below:

# ./disk_long glance I/O by Disk Report Logl I/O: Phys I/O:

9. Performance tuning — immediate reporting. Ensure the immediate reporting options are

set for the disk that the file system is located on. If immediate reporting is not set, set it.

# scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache.

# cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs

10. The sar -d report.

Exit glance, and in the second window start: # sar -d 5 200

From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files).

# timex ./disk_long

• How busy did the disk get? • What was the average number of requests in the I/O queue? • What was the average wait time in the I/O queue? • How much real time did the task take?

How do the results of step 11 compare to the results in step 6?

________________________________________________________________


9-1


Objectives


• List three ways HFS file systems are used.

• List basic HFS file system data structures.

• Identify HFS file system bottlenecks.

• Identify HFS kernel system parameters.



9–1. SLIDE: HFS File System Overview

Student Notes The HFS model is a foundation for all other file system variants. We will begin our discussion of File System performance using the HFS file system model.

The HP-UX File System

The HFS file system strategically lays out its data structures on disk to most efficiently utilize the geometry of the disk. The design of the HFS file system can best be explained by looking at the file system from three perspectives.

Physical View

From a physical disk perspective, the disk drive upon which a file system is placed contains sectors, tracks, platters, and disk heads. A key behavior of most all disk drives is that the disk heads move in parallel across the platters in such a way that each disk head is over the same track within each platter at the same time. To maximize the file system I/O throughput of the disk, it is desirable to have as many file blocks close to each other as possible, to minimize the time it takes to read or write the various blocks of a file. To help achieve this goal, the blocks on the disk are allocated to

HFS File System Overview

Physical View

CylinderGroupTracks

Logical View Internal CylinderGroup View

Cylinder Group 1

Cylinder Group 2

Cylinder Group 3

Cylinder Group N

.

.

.

PrimarySuperblock

InodeTable

Red.SprBk

CylinderGrp Hdr

Data Blocks

Data Blocks



9-3

the HFS file system in units call cylinder groups. A cylinder group is all the tracks, from every platter, grouped together, of several adjacent cylinders.

Cylinder Group Analogy

Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder group would be the same group of lanes from each floor's jogging track. In other words, all lane 1, 2, and 3 tracks would make up cylinder group 1; all lane 4, 5, and 6 tracks would make up cylinder group 2, etc. By organizing space on disks in cylinder group units, the HFS file system can logically keep all the blocks of a given file close to each other. For example, in the slide above, the first 6 blocks of a file might be allocated as follows:

File block #1: Platter #1, Track #1, Sector #1 File block #2: Platter #1, Track #1, Sector #2 File block #3: Platter #1, Track #3, Sector #5 File block #4: Platter #1, Track #3, Sector #6 File block #5: Platter #2, Track #7, Sector #10 File block #6: Platter #3, Track #9, Sector #7

By allocating file system space in this manner, a multiple block read (say 6 blocks) could be read with less than six separate reads. In the example above, file blocks 1 and 2 could be read with one read operation, followed by a head switch (no carriage movement) to track 3, another read for file blocks 3 and 4, a short seek to the next cylinder and a head switch to read file block 5, and repeat for file block 6. Four reads could then read the six blocks. The more contiguous the blocks that make up the file, the more efficient the reads and writes can be.

Logical View

From a logical perspective, an HFS file system contains a series of cylinder groups. Even though the physical cylinder groups are laid out from top to bottom, transcending all the platters, logically, we view the cylinder groups as horizontal units going from left to right. The HFS file system is made up of multiple cylinder groups, where the number of cylinder groups is dependent on the size of the file system. In the slide, we assume the HFS file system takes the whole disk, therefore, there are N cylinder groups in the sample file system. Typically, they are numbered from 0 to N-1. A critical data structure contained with every HFS file system is the primary superblock. The primary superblock is located at the start of every HFS file system at the start of the first cylinder group, and contains the critical header information for the HFS file system. Data structures contained within the superblock include the free block list, the mount flag, the starting address of each cylinder group, and much more.



Internal Cylinder Group View

Within each cylinder group, the following data structures exist: Data blocks The data blocks are where files are stored within the cylinder

group. The data blocks are distributed in such a way that a portion of the data blocks come before the cylinder group header structures and the rest come after the cylinder group header structures. This ensures that the cylinder group header structures are randomly placed throughout the cylinder groups.

Redundant Superblock A redundant copy of the primary superblock is contained

within each cylinder group. These redundant copies are kept to protect against the loss of the primary superblock. The locations of the redundant superblocks can be viewed by displaying the contents of the /etc/sbtab file. Should the primary superblock become lost or corrupted, the file system could still be recovered by executing the fsck command and specifying the location of one of the alternate superblocks.

Cylinder Group Header The cylinder group header contains the header information for

the cylinder group. This information includes the free blocks within the cylinder group, the starting addresses of the inode tables for that group, and a list of free inodes for the local inode table.

Inode Table The inode table contains all the inodes (file header structures)

for files located within the cylinder group. Every file within a file system is managed by an inode, which describes the attributes and location of the file. The inode table is divided into equal-sized sections and a section is stored in each cylinder group. Inodes within a cylinder group point to files usually contained within the same cylinder group.



9-5

9–2. SLIDE: Inode Structure

Student Notes An inode contains all the header information for a particular file. Every file has a corresponding inode, usually located within the same cylinder group as the file. Fields contained within the inode include: • File type • File access permissions • Number of hard links to the file • Owner and group of the file • Size of the file in bytes • Time stamps (file access, file modification, inode changes) • Data block pointers (direct and indirect) NOTE: Although the size of the inode differs from one type of file system to another,

the basic types of data contained is virtually the same, the main differences are in the data pointer structures.

Inode Structure

InodeTable

Red.SprBk

CylinderGrp Hdr

Data Blocks

File

Inode for File

Type Permissions Links

Owner

Atime Mtime CTime Data Block Pointers

Group Size



9–3. SLIDE: Inode Data Block Pointers

Student Notes One of the structures within each HFS inode is the array of data block pointers that reference the data blocks within the file. The size of the data block pointer array is 15 entries, meaning there are a maximum of 15 file system block addresses within the array. The first 12 addresses within the data block pointer array are “direct access” addresses. The thirteenth entry is a “single indirection” block address, the fourteenth is a “double indirection” block address, and the fifteenth (and last) entry is a “triple indirection” block address.

Direct Access

A direct access address points directly to a file's data block. When accessing a file using a “direct access” address, a minimum of two logical I/Os are needed: one I/O to access the file's inode (containing the direct access address), and one I/O to access the file's corresponding data block.

Inode Data Block Pointers

Direct Access Single Indirection Double Indirection

Inode

Data BlockData BlockData Block

Data Block

.

.

.

.

.

.

Inode

Data Blocks

Inode

Inode Extension

2 Logical I/Os needed toaccess each 8 KB of data



Data Blocks



9-7

Single Indirection

Single indirection implies the address within the inode references a block on disk that acts as an inode extension block. The inode extension block, in turn, contains addresses that point to the file's corresponding data blocks. It should be noted that three logical I/Os are needed to access a file's data blocks using single indirection: one I/O for the file's inode, one I/O for the inode extension block, and one I/O for the data block itself.

Double Indirection

Double indirection means access to a file's data blocks require going through two inode extension blocks. The first inode extension block references the address of a second inode extension block, which contains addresses referencing the file's datablocks. Double indirection is needed only for files above 16 MB (with a default block size of 8KB). When accessing files requiring double indirection, a total of four logical I/Os are required: an I/O for the file's inode, an I/O for each of the two inode extension blocks, and an I/O for the file's data block.

Triple Indirection

Triple indirection (not shown on the slide) adds one more level of indirection when accessing a file's data blocks. Triple indirection is only needed to access files larger than 32 GB (with a default block size of 8KB). NOTE: Every level of indirection adds an additional logical I/O when accessing the

file's data. In the case of triple indirection, five logical I/Os are needed compared to two I/Os for direct access data blocks.

As you can see, the performance of an HFS file system tends to favor small files (12 blocks or less), and tends to penalize large files that have to use single, double, or even triple indirection. You can delay this performance degradation somewhat, by building the file systems with larger block sizes. (More on that later in the module.)



9–4. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd?

Student Notes The above slide illustrates how a file within the HFS file system is accessed. It may surprise some people when they find out how many logical I/Os are needed to access the /etc/passwd file.

Starting from the Top

When the full pathname of a file is specified for access (as in /etc/passwd), the kernel starts with the only inode it knows: the inode of the root directory of the root file system. Inode number 2 is always the inode of the root directory of any file system. “/” symbolizes (in the kernel) the root directory of the root file system. Using the slide as an example, after reading inode 2 of the root file system (first logical I/O), the kernel discovers that the contents of the root directory (the listing of the files contained in that directory) are located at file system block 74. Upon reading block 74 (second logical I/O), the names of the files in the root directory and their corresponding inode numbers are known. Directories are primarily listings of file names and the numbers of the inodes that manage them.

How Many Logical I/Os Does It Take to Access /etc/passwd?


Owner Group Size

ATime MTime CTime


Owner Group Size

ATime MTime CTime


Owner Group Size

Atime MTime CTime

2240

717

74

Inode 2/ (root)

Inode 504/etc

Inode 1824/etc/passwd

2 /504 /etc

Blk. 74

Inodes (500 - 749)1123 host1824 passwd

Blk. 717

2

504

Inodes (1750 - 1999)root::0:3;.sys::3:3:..

Blk. 2240

1824

Inodes (0 - 249)



9-9

From this information, the kernel discovers the inode for the etc directory (in “/”) is 504. Inode 504 is then read (third logical I/O) and from that the kernel learns the etc directory is located at file system block 717. Block 717 is read (4th logical I/O) and the file names and inodes contained within that directory are now known. One of the entries within block 717 is the passwd file and its corresponding inode number 1824. The inode 1824 is read (5th I/O), and from this the kernel finally learns that block 2240 is the one that contains the contents of the /etc/passwd file. Block 2240 is read (6th I/O) and the kernel finally has the data it set out to access. So, the answer to the question at the top of the slide, “How many logical I/Os does it take to access /etc/passwd?” is . . . 6.



9–5. SLIDE: File System Blocks and Fragments

Student Notes The concept of blocks and fragments was introduced when the HFS file system was designed. There is always a tradeoff when managing a resource based on a fixed allocation unit size (the file system "block" in this case). If the block size is large we can manage them with fewer pointers (system overhead) but if it is too large there is an opportunity for inefficient utilization of the resource (very small files still require a block). In the case of the HFS file system this concern was addressed by making the block capable of uniform subdivision. The fragment was created for this purpose.

Definitions

Sector A sector is the smallest unit of space addressable on the physical disk. The sector size is used when the disk is formatted to appropriately place timing markers on the platter. The default sector size for HP-UX and most UNIX systems is 512 bytes.

Fragment A fragment is the increment in which space is allocated to files within

the HFS file system. The default fragment size is 1 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 1K, 2K, 4K, and 8K.

File System Blocks and Fragments

End

of

Disk

FileAFileBFileBFileCFileC

FileDFileDFileDFileD

FileEFileEFileEFileEFileE

FileFFileFFileFFileFFileFFileF

FileA, FileC, and FileD grow by 1 fragment

Fragment

File System Block

End

of

Disk

FileBFileB

FileAFileA

FileDFileDFileDFileDFileD

FileEFileEFileEFileEFileEFileCFileCFileC




9-11

File System Block A file system block is the minimum amount of data transferred to/from the disk when performing a disk I/O on an HFS file system. The default file system block size is 8 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 4K, 8K, 16K, 32K, and 64K.

Example — Top Half

The top half of the slide shows the allocation of disk space when the following six files are created (assuming only 5 file system blocks are free within the HFS file system). File A (size 1 KB): The kernel searches for the first free fragment. On the slide, the first

fragment in the first file system block is allocated. File B (size 2 KB) The kernel searches for the first 2-KB continuous fragment that is

available. This is in the same file system block in which FileA was allocated. The fact that FileA has already been allocated in this file-system block does not matter. Multiple files can be allocated within the same file system block. The first basic rule is: Best fit on “close”.

File C (size 2 KB): The kernel searches for the first 2-KB continuous fragment available.

This is in the same file system block as FileA and FileB. Hence, FileC is allocated 2 KB from this same file-system block. If any of these three files are accessed, then all three files are read into the file system buffer cache as a single unit.

File D (size 4 KB): The kernel searches for the first four contiguous 1KB fragments

available (within the same file system block). This is in the second file system block. The kernel does not allocate 3 fragments from the first file system block and 1 fragment from the second file system block, because that would require two logical I/Os to read in the entire 4 KB. This is inefficient, as only one I/O is required if the file is contained within the same file system block. The second basic rule is: If the size of a file is 8 KB or less, the kernel will fit the entire file within a single file-system block.

File E (size 5 KB) and File F (size 6 KB):

The kernel searches for the first available file-system block that can hold the entire file. On the slide, FileE is allocated in file system block 3, and FileF is allocated in file system block 4.



Example — Bottom Half

The bottom half of the slide illustrates how the growth of three files affects allocation within the HFS file system. FileA (1KB -> 2KB): When FileA grows, it cannot grow into the next fragment because

FileB is occupying this spot. Therefore, the kernel relocates FileA to the first free 2 KB that is within the same file system block. (Why transfer another block into memory, at this point?)

FileC (2KB -> 3KB): When FileC grows, it cannot grow into the next fragment, because

FileA is now in that spot. Therefore, the kernel relocates FileC to the first free 3 KB that is in a different file system block. (It can no longer fit into the first block). It selects block three because that block has a space exactly suited for the three-block FileC. The third basic rule is: if a file owns multiple fragments within the same block, they must be contiguous.

FileD (4KB -> 5KB): When FileD grows, it simply grows into the next fragment because it

is still free.



9-13

9–6. SLIDE: Creating a New File on a Full File System

Student Notes As an HFS file system becomes full, the performance impact of creating a new file becomes significant. This is due to the behavior of the kernel when creating a new file: When a new file is created on HFS file systems, the kernel tries allocates a block-sized buffer in buffer cache for the file to grow into. Upon the file being closed, the kernel allocates the file's fragments to an already allocated file system block, if possible.

FileG Is Created

In the example on the slide, FileG is opened/created as a new file. Not knowing the size to which FileG will grow, the kernel allocates a block-sized buffer in buffer cache for FileG to grow into. When FileG is closed, the kernel searches for a set of four contiguous 1KB fragments in a block. Since there are no shared blocks that have four contiguous fragments, the file is written to a new, empty block.

What Happens When FileH Is Created?

The impact of creating new files on a full file system can be seen when FileH is created. When FileH is opened for creation, the kernel allocates a block-sized buffer in buffer cache.

Creating a New File on a Full File System

New FileG (4 KB) is created

End

of

Disk

FileGFileGFileGFileG

FileBFileB

FileAFileA

FileDFileDFileDFileDFileDFileCFileCFileC


FileEFileEFileEFileEFileE

What happens when new FileH (1 KB) is created?



As it turns out, FileH is closed after writing only 1 KB worth of data. Upon closure, FileH is moved to file system block 1, first fragment. NOTE: Performance on HFS file systems typically degrades when free space falls

below 10%, due to the length of time it takes to find free file system blocks for new files. For this reason, it is recommended that MINFREE always be 10% or greater, even for large file systems (greater than 4 GB).

The fourth basic rule is: No fragment belonging to another file will be moved to make room for this file.



9-15

9–7. SLIDE: HFS Metrics to Monitor — Systemwide

Student Notes When monitoring disk I/O activity, the main metrics to monitor are:

• Percent utilization of the file systems: As utilization of the file system increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the file system is 50% busy, than it does when the file system is idle.

• Requests in the file system I/O queue: The number of requests in the file system I/O queue is one of the best indicators of a file system performance problem. If the average number of requests is three or greater, then requests are having to wait in the queue longer than the amount of time needed to service those requests.

• Amount of physical I/O: If the amount of file system activity is high, it is important to investigate on which file system the activity is occurring.

• File system free space: As an HFS file system becomes full (greater than 90%), it takes longer and longer to find an available free fragment for a new file or to grow an existing file. This creates additional disk activity, leading to slow file system performance.

HFS Metrics to Monitor — Systemwide

• Utilization of the file systems• File system I/O queue lengths• Amount of physical I/O to the file systems• File system free space• Open files for each process



• Files opened with heavy access: For each process performing large amounts of file system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, (hit <CR> frequently, then inspect how quickly the offset to each file changes) consider relocating these files to other disks that are less busy.



9-17

9–8. SLIDE: Activities that Create a Large Amount of File System I/O

Student Notes Common causes of disk-related performance problems are shown on the slide.

• Full file system cause excessive I/O due to locating free fragments.

• Long, inefficient PATH variables cause excessive directory I/O (especially when the command is found in the last directory within the PATH variable).

• Deep subdirectories cause lots of logical I/Os (two logical I/Os for each subdirectory in the full path name).

• Sequential file access, with a small file system block size, causes excessive amounts of physical I/O.

• Accessing lots of files on one file system, versus many, creates an imbalance of utilization. This leads to performance problems with the busy file systems and under utilization with the others.

Activities that Create a Large Amount of File System I/O• File writes on an almost full file system

• Long, inefficient PATH variables

• Deep subdirectory structures

• Accessing large files sequentially with a small READ block size

• Accessing many files on a single disk



9–9. SLIDE: HFS I/O Monitoring bdf Output

Student Notes The bdf report shows how much file system space is being used (and how much is free) for all file systems currently mounted on the system. The key fields are: avail Indicates the amount of disk space available on the file system (in KB). %used Indicates the percentage of disk space used. The slide shows there are three file systems with 90% usage or more, and one of the file systems is at 100% utilization. Recall that when an HFS file system becomes full, performance on that file system suffers due to fragments being moved. The good news is that the amount of free space which is being held back by the file system parameter, MINFREE, is already subtracted from the values. In fact, if you compare the kbytes, used, and avail columns, you’ll see that something is missing. used + avail do not add up to be kbytes. The difference is MINFREE. For example, look at /stand. Clearly, 22403 + 20643 does not equal 47829. In fact, 22403+20643 divided by 47829 equals 90%, indicating that MINFREE must be set to 10% for this file system.

HFS I/O Monitoring bdf Output

# bdfFilesystem kbytes used avail %used Mounted on/dev/root 81920 38018 40901 48% //dev/vg00/lvol1 47829 22403 20643 52% /stand/dev/vg00/lvol6 286720 257116 28003 90% /usr/dev/vg00/lvol4 360448 346127 13444 96% /opt/dev/dsk/c0t4d0 1177626 1113204 0 100% /disk/dev/vg00/lvol7 122880 102098 19257 84% /var/dev/vg00/lvol5 53248 22589 28549 44% /tmp



9-19

9–10. SLIDE: HFS I/O Monitoring glance — File System I/O

Student Notes The glance file system I/O report (i key) shows activity on a per file system basis. Only total I/O activity (not reads versus writes) is shown with this report. This report is similar to the logical volume report (discussed in the previous module) except this report shows logical I/O compared to physical I/O, and does not distinguish between read and write activities. The logical volume report shows reads compared against writes, but does not distinguish between logical and physical activities. From the report on the slide, we note that all the file system activity is being performed against one file system. Note: The file system I/O report shows I/O activity for all types of mounted file

systems, including CDFS file systems and NFS-mounted file systems.

HFS I/O Monitoring glance —File System I/O


IO BY FILE SYSTEM Users= 4Idx File System Device Type Logl IO Phys IO--------------------------------------------------------------------------------

1 / /dev/root vxfs 0.3/ 0.6 0.0/ 0.02 /stand /dev/vg00/lvol1 hfs 0.0/ 0.0 0.0/ 0.03 /var /dev/vg00/lvol9 vxfs 1.0/ 1.8 0.1/ 0.34 /usr /dev/vg00/lvol8 vxfs 9.2/ 2.8 1.5/ 0.65 /tmp /dev/vg00/lvol7 vxfs 0.0/ 0.0 0.1/ 0.06 /opt /dev/vg00/lvol6 vxfs 0.0/ 0.0 0.0/ 0.07 /home.lvol5 /dev/vg00/lvol5 vxfs 0.0/ 0.0 0.0/ 0.08 /export /dev/vg00/lvol4 vxfs 0.0/ 0.0 0.0/ 0.09 /disk /dev/vg01/lvol1 vxfs 463.8/ 86.4 105.8/ 20.1

10 /cdrom /dev/dsk/c1t2d0 cdfs 0.0/ 0.0 0.0/ 0.011 /net e2403roc:(pid604) nfs 0.0/ 0.0 0.0/ 0.0

Top disk user: PID 3603, disc 104.0 IOs/sec S - Select a Disk

S S R U U

F F

S S U U B B

U U R R



9–11. SLIDE: HFS I/O Monitoring glance — File Opens per Process

Student Notes The glance open files report (F key), available only from the select process report (s key), shows the names of files opened for the currently selected process. Sometimes, the full path name of the file is shown. Otherwise, the inode number, and device name are shown and you would have to translate that information into the filename. NOTE: To determine the full pathname of a file, given its inode number and logical

volume name, use the ncheck command: ncheck -F vxfs -i [inode #] [device name]

Another way to determine the full pathname of a file, given its inode number and logical volume name, is to use the find command: find [mountpoint of device] –inum [inode #] -xdev

HFS I/O Monitoring glance —File Opens per Process

B3692A GlancePlus B.10.12 06:44:39 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------Cpu Util |100% 100% 100%Disk Util | 83% 22% 84%Mem Util | 94% 95% 96%Swap Util | 21% 21% 22%--------------------------------------------------------------------------------Open Files for PID: 3911, disc PPID: 2410 euid: 0 User:root

Open OpenFD File Name Type Mode Count Offset--------------------------------------------------------------------------------

0 /dev/pts/1 chr rd/wr 6 135828261 /dev/pts/1 chr rd/wr 6 135828262 /dev/pts/1 chr rd/wr 6 135828263 <reg,vxfs,inode:3024,/...ol9,vnode:0x00f9e000> reg read 1 854 /stand/file5 reg write 1 32768

10 /dev/null chr read 2 0

S S R U U

F F

S S U U B B

U U R R



9-21

To determine whether I/O activity is occurring against a file, enter the open file report for a particular process, and press <CR> multiple times in succession. Watch the offset field for each file. If the offset field is constantly changing, it indicates the file is currently being accessed.

Performance Scenario

A system is experiencing slow performance due to high file system utilization. Upon further investigation, not all file systems are heavily utilized. In fact, some show no activity at all. By sorting the processes within glance by disk I/O activity, then selecting those processes to obtain further details, you can determine which files are getting the majority of the activity. To take advantage of the underutilized file system, move the heavily accessed files to this file system and create a symbolic link to the file from its original location, thereby removing a heavily accessed file from a busy file system and putting it on an underutilized file system.



9–12. SLIDE: Tuning a HFS I/O-Bound System — Tune Configuration for Workload

Student Notes Every workload and every application is different. Each has different resource requirements and each places different demands on the system. There is no one configuration that is optimal for all applications. For example, CAD-CAM application stress memory (and graphics); accounting applications that do forecasting stress CPU; NFS-based applications stress the disks (and the network); and RDBMS applications stress all resources.

File System Blocks and Fragments

Tips and notes for choosing the sizes of file system block and fragments follow:

Fragments

• Fragment sizes can be 1, 2, 4, or 8 KB in size. • Fragments can be 1/8, ¼, ½, or equal to the file system block size. • For large files which are opened and closed a lot during their growth, large fragments are

recommended. • For file systems with lots of small files, small fragments are recommended.

Tuning an HFS I/O-Bound System —Tune Configuration for WorkloadTune the following parameters, based on workload:• File System Block and Fragment Sizes• Blocks per cylinder group (maxbpg)• File system mount options• The mkfs options when creating the file system• The tunefs options can modify parameters on existing file systems

Tune other configurations, based on workload:• Optimize $PATH variables• Use flat directory structures when possible• Ensure sufficient freespace exists on file systems



9-23

File System Blocks

• File system blocks sizes can be 4, 8, 16, 32, or 64 KB. • For file systems with large files, large file system blocks are recommended. • For file systems with large files, increase maxbpg (maximum blocks per group). • For applications which perform a lot of sequential I/O (with read-aheads and

write-behinds), large file system blocks are recommended.

HFS Mount Options

The mount options affect performance by specifying when files on the file system are updated. These options can be specified in the options column of the /etc/fstab file. The HFS-specific mount options include: behind Enable, when possible, asynchronous writes to disk. This is the default for

workstations. It does not use the sync daemon. delayed Enable delayed or buffered writes to disk. This is the default for servers. It

does use the sync daemon. fs-async Enable relaxed (asynchronous) posting of file system metadata (changes to

the superblocks, inodes, etc.). This option may improve file system performance, but increases exposure to file system corruption in the event of power failure.

no_fs_async Force rigorous (synchronous) posting of file system metadata to disk. This is

the default.

mkfs Options

mkfs is usually not executed directly, but is called by newfs -F hfs instead. File system tuning is best accomplished when the file system is created. The workload for a file system should be well understood and dedicated before serious attempts are made to tune one. Many options are also dependent on the type of physical device on which a file system is being created. The HFS specific options include: size The size of the file system in DEV_BSIZE blocks (the default is the entire

device). largefiles The maximum size of a file can be up to 128 GB. nolargefiles The maximum size of a file will be limited to 2 GB. ncpg The number of cylinders per cylinder group (range 1-32, the default is 16). minfree the minimum percentage free disk space reserved for non-root processes

(default is 10%). Beginning with HP-UX 10.20 the bdf command does not conceal this free space and as a result will report free disk space accurately. This means that a file system cannot show 111% utilization anymore.



nbpi The number of bytes per inode. This value determines how many inodes are

allocated given a file system of a certain size. (The default is 6144.) tunefs Options Some parameters can be changed once the file system has been created, with tunefs(1m). There are minfree and maxbpg. minfree is explained above. maxbpg The maximum number of data blocks that a large file can use out of a cylinder

group, before it is forced to continue to grow in a different cylinder group. This value does not apply to any file which size is 12 blocks or less.

tunefs can also be used to display the contents of an HFS file system:

# tunefs –v /dev/…/…

Other Configurations

Optimize $PATH

The PATH variable in a user's environment specifies a list of directories to search when a command is entered. Having an excessive number of directories or duplicate directories to search can increase disk access, particularly when the user makes a mistake typing a command. This problem can be greatly exacerbated if the user's PATH variable contains directories that are mounted automatically with the NFS automount utility, causing the network mount of a file system because of a typographical error.

Use Flat Directory Structures

Long directory path names create more work for the system because each directory file and its associated inode entry require a disk I/O in order to bring them into memory. Recall that six logical I/Os were required to read the /etc/passwd file. Conversely, you don’t want thousands of files in the same directory, as it would take many I/O operations to read and search the directory.

Ensure Sufficient Freespace

As the file system becomes full (greater than 90%), the kernel begins to take longer and longer to find available free fragments. The algorithm gets very lengthy when the file system free space falls below 10%. Of course, if you do not have any files that grow and you are not adding any new files, this would waste 10% of your file system free space for no reason.



9-25

9–13. SLIDE: Tuning a HFS I/O-Bound System — Use Fast Links

Student Notes There are two ways symbolic links can be stored on HFS file systems.

Standard Symbolic Links

Standard symbolic links are implemented in the same way as they are on other UNIX systems. The inode for the symbolic link points to a data block on disk, and the contents of the data block contains the name of the file being referenced by the symbolic link. In the example on the slide, /usr/data is the symbolic link with an inode number of 12. The contents of inode 12 contain an address pointer to data block 74, and the contents of data block 74 contain the name of the file being referenced (in the example, /data). Two logical I/Os are required to resolve the symbolic link, one I/O to retrieve the inode and one I/O to retrieve the data block containing the referenced name.

HP Fast Links

HP fast links allow symbolic links to be resolved with one logical I/O instead of two. HP fast links store the name of the referenced file in the inode of the symbolic link itself, rather than in a data block that the inode references. In the example, when the inode (12) of the symbolic link is retrieved, the contents of the inode contain the name of the referenced file.

Tuning a HFS I/O-Bound System —Use Fast Links

Owner Group Size

ATime MTime CTime74

/data

Blk. 74

12


Owner Group Size

ATime MTime CTime/ d a t a

12


/usr/data -> /data /usr/data -> /data

Standard Symbolic Links HP Fast Links



HP fast links can be configured by setting the tunable OS parameter create_fastlinks to 1, and recompiling the kernel. Upon booting from the new kernel, all future symbolic links created will use HP fast links. No existing standard symbolic links will be automatically converted to fast symbolic links. The standard symbolic links would have to be removed and then recreated to convert them. Fast symbolic links will only work for link destinations that can be expressed in 59 characters or less as this is the limit of the space within the inode where the fast link information is stored. If a symbolic link contains more than 59 characters, it will be stored as a standard symbolic link, regardless of the value of create_fastlinks.

Transition Links

Saving one logical I/O when accessing a symbolic link may not seem significant, until considering that HP-UX makes heavy use of transition links (which are an implementation of symbolic links). Transition links allow an HP-UX file system to contain older 9.x directory paths. The 9.x directory names are symbolic links that point to the correct, current location (for example, /bin > /usr/bin). Many HP-UX installations have applications (including HP-UX applications), which rely on and make heavy use of transition links. A quick performance gain for all HP-UX systems is to convert these transition links from standard symbolic links to HP fast links. The procedure for making this conversion is: 1. Recompile the kernel to use HP fast links (i.e. set the create_fastlinks to 1).

2. Shut down and reboot the system.

3. Execute tlremove to remove all the transition links from the system. Over 500 links will be removed.

4. Execute tlinstall to reinstall (that is, recreate) the transition links. When the links are reinstalled, they will be created with HP fast links.



9-27

9–14. LAB: HFS Performance Issues

Directions The following lab illustrates a number of performance issues related to HFS file systems. 1. A 512 MB HFS file system is required for this lab. Use the mount and bdf commands to

determine if such a file system is available.

# mount –v # bdf

If there is no such HFS file system available, create one using the commands below:

# lvcreate -n hfs vg00 # lvextend –L 512 /dev/vg00/hfs /dev/dsk/cXtYdZ (second disk) # newfs -F hfs /dev/vg00/rhfs # mkdir /hfs # mount /dev/vg00/hfs /hfs

2. Copy the lab files to the newly created HFS file system.

# cp /home/h4262/disk/lab1/disk_long /hfs # cp /home/h4262/disk/lab1/make_files /hfs

Next, execute the make_files program to create five 4-MB ASCII files.

# cd /hfs # ./make_files

3. Purge the buffer cache of this data, by unmounting and remounting the file system.

# cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs



4. Time how long it takes to read the files with the cat command. Record the results below:

# timex cat file* > /dev/null real: user: sys:

5. In a second window start: # sar -d 5 200

From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long


6. Performance tuning — recreate the file system with larger fragment and file system block

sizes.

Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below:

# lvcreate -n custom-lv vg00 # lvextend –L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs

7. Copy the lab files to the customized HFS file system, execute the make_files program,

and purge the buffer cache.

# cp /hfs/disk_long /cust-hfs # cp /hfs/make_files /cust-hfs # cd /cust-hfs # ./make_files # cd / # umount /cust-hfs



9-29

# mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs


# timex cat file* > /dev/null real: user: sys: How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________

9. Performance tuning — change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties.

# cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs

10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files).

# timex ./disk_long

• How busy did the disk get? • What was the average number of requests in the I/O queue? • What was the average wait time in the I/O queue? • How much real time did the task take?

How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________


10-1


• Understand JFS structure and version differences

• Explain how to enhance JFS performance

• Set block sizes to improve performance

• Set Intent-Log size and rules to improve performance

• Understand and manipulate synchronous and asynchronous IO

• Identify JFS tuning parameters

• Understand and control fragmentation issues

• Evaluate the overhead of online backup snapshots



10–1. SLIDE: Objectives

Student Notes Upon completion of this module, you will be able to do the following:

• Understand JFS Structure and version differences

These course notes are based on the JFS Version 3.5 file system, built on Version 4 disk layout. The next few slides will describe the basic differences between versions and relate them to HP-UX releases. HP JFS 3.5 and HP OnlineJFS 3.5 are available for HP-UX 11i and later systems.

The standard (base) version of HP JFS has been bundled with HP-UX since release 10.01. The “advanced” HP OnlineJFS is a purchasable product with additional administrative features for higher availability and tunable performance. These notes will make clear which features belong to the base product and which belong to the OnlineJFS version.

The Operating Environment delivery model of HP-UX 11i includes JFS as follows:

HP-UX 11i OE BaseJFS 3.3 HP-UX 11i Enterprise OE OnlineJFS 3.3 HP-UX 11i Mission Critical OE OnlineJFS 3.3

Objectives

Upon completion of this lesson, you will be able to:• Understand JFS structure and version differences• Explain how to enhance JFS performance• Set block sizes to improve performance• Set Intent-Log size and rules to improve performance• Understand and manipulate synchronous and asynchronous I/O• Identify JFS tuning parameters• Understand and control fragmentation issues• Evaluate the overhead of online backup snapshots



10-3

You can download JFS 3.5 for HP-UX 11i for free from the HP Software Depot (http://www.software.hp.com), or you can request a free JFS 3.5 CD from the Software Depot. You can purchase HP OnlineJFS 3.3 (product number B3929CA for servers and product number B5118CA for workstations) for HP-UX 11.0 or HP-UX 11i from your HP sales representative. JFS 3.5 is included with HP-UX 11i systems.

• Explain how to enhance JFS performance The HFS file system uses block based allocation schemes, which provide adequate random access and latency for small files but limit throughput for larger files. As a result, the HFS file system is less than optimal for commercial environments. VxFS addresses this file system performance issue through an alternative allocation scheme and increased user control over allocation, I/O, and caching policies.

• Set Block Sizes to improve performance It is often advantageous to match the block size of a file system to the I/O size of the application. We will show you how!

• Set Intent Log size to improve performance The JFS intent log provides for rapid fsck recovery after a system crash. In general the intent log is not protecting your data; the focus is on structural integrity and not data integrity! Fast fsck comes at a price and that price is performance. Setting the correct intent log size is important as it cannot be changed once a file system is created.

• Understand and manipulate synchronous and asynchronous I/O Programmers and data base providers do different types of I/Os to obtain the best possible balance between data integrity and performance. We will investigate all the “gray” areas and tune the JFS file system to meet our administrative and performance goals which might be quite different to those of the programmer!

• Identify JFS tuning parameters The JFS is tunable through mount options, the command line, configuration files and kernel parameters. We will learn where and how to tune.

• Understand and control fragmentation issues The extent based file allocation design of JFS is ideal for performance of large files. One weakness of this approach is the potential fragmentation of files and free space over the life of the file system. In general this will only occur in dynamic work file orientated JFS file systems (e.g. a mail server) and is unlikely in fixed “large file” file systems where major I/O rates occur to static files (e.g. a data base). We will investigate ways of measuring and fixing fragmentation.



• Evaluate the overhead of online backup snapshots OnlineJFS supports online backups via snapshot mounts. We will discuss the performance issues involved when working with snapshots.



10-5

10–2. SLIDE: JFS History and Version Review

Student Notes The HP-UX Journaled File System (JFS) was introduced by HP in August, 1995, on the HP-UX 10.01 release. The journaled file system attempts to improve on the high-performance file system (HFS) by offering the following enhancements:

• Extent-based allocation of disk space

• Fast file system recovery through an Intent Log

• Greater control and flexibility of file system behavior through new mount options and tunable options.

JFS History and Version Review

• JFS introduced in 1995 with HP-UX 10.01• Version 2 structure at introduction• Version 3 structure at 10.20 allows 1TB files• Version 4 structure at 11.00 allows more tunable controls and supports

ACLs• Do not use V4 structure on 11.00 for /, /usr, /opt, /var• vxupgrade(1M) tool can migrate up through versions (not down!)• 11i delivers JFS 3.5 software on V4 structure• Differences between Base JFS 3.5 and OnlineJFS 3.5



Disk Layout Versions

Version 1 The Version 1 disk layout was never used in HP-UX. Version 2 The Version 2 disk layout has the following changes and features:

• Many internal JFS structures are dynamic files themselves. • Internal “filesets” separate data files (User Fileset) from structural files

(Structural Fileset). • Allocation units now contain data and data map structures only, inode

tables are elsewhere. • inode allocation is dynamic and cannot run out. • Optional support for quotas.

Version 3 The Version 3 disk layout offers additional support for:

• Files up to one terabyte • File systems up to one terabyte • Indirect inode extent maps can now address variant length file extents. V2

restricts all indirect extents to the size of the first indirect extent. Hence large files and sparse files possible with less overhead.

Version 4 Version 4 is the latest disk layout:

• The Version 4 disk layout supports Access Control Lists. • The Version 4 disk layout does not include significant physical changes

from the Version 3 disk layout. Instead, the policies implemented for Version 4 are different, allowing for performance improvements, file system shrinking, and other enhancements.

• HP-UX 11i with Version 4 layout now supports both files and file systems up to 2TB in size.

Table: Matching HP-UX version to JFS version

HP-UX Release

VxFS Version Supported Disk

Layouts

Default Disk

Layout 10.10 2.3 2 2 10.20 3.0 2,3 3

11.00 with JFS 3.1 3.1 2,3 3 11.00 with JFS 3.3 3.3 2,3,4 3

11i v1 3.3 2,3,4 4 11i v2 3.5 2,3,4 4

vxupgrade(1M)

The vxupgrade command can upgrade an existing Version 3 VxFS file system to the Version 4 layout while the file system remains online. vxupgrade can also upgrade a Version 2 file system to the Version 3 layout. See vxupgrade(1M) for details on upgrading VxFS file systems. You cannot downgrade a file system that has been upgraded.



10-7

NOTE: You cannot upgrade the root (/) or /usr file systems to Version 4 on an 11.00

system running JFS 3.3. Additionally, we do not advise upgrading the /var or /opt file systems to Version 4 on an 11.00 system. These core file systems are crucial for system recovery. The HP-UX 11.00 kernel and emergency recovery media were built with an older version of JFS that does not recognize the Version 4 disk layout. If these file systems were upgraded to Version 4, your system might have errors booting with the 11.00 kernel as delivered, or booting with the emergency recovery media.

Comparing Base and Advanced JFS

Table: Comparing Base and Online JFS Feature JFS 3.5 OnlineJFS 3.5 extent-based allocation * * extent attributes * * fast file system recovery * * access control list (ACL) support * * enhanced application interface * * enhanced mount options * * improved synchronous write performance * * support for large files (up to two terabytes) * * support for large file systems (up to two terabytes) * * enhanced I/O performance * * support for BSD-style quotas * * unlimited number of inodes * * file system tuning [vxtunefs(1M)] * * online administration * ability to reserve space for a file and set fixed extent sizes and allocation flags

*

online snapshot file system for backup * direct I/O, supporting improved database performance * data synchronous I/O * DMAPI (Data Management API) *

How to tell if JFS 3.5 is installed To determine if a vmunix file has JFS3.5 compiled into it, you can run:

what /stand/vmunix | grep libvxfs.a

or

nm /stand/vmunix | grep vx_work If you get output from either of these two commands, then the vmunix file has JFS 3.5 compiled into it, e.g.:



# what /stand/vmunix | grep libvxfs.a $Revision: libvxfs.a: CUPI80_BL2000_1108_2 Wed Nov 8 10:59: 22 PST 2000 $ # nm /stand/vmunix | grep vx_work [13585] | 9746968| 8|OBJT |LOCAL|0| .rodata|S$704$vx_workli st_gettag [13587] | 9746976| 8|OBJT |LOCAL|0| .rodata|S$705$vx_workli st_enqueue [13589] | 9746984| 8|OBJT |LOCAL|0| .rodata|S$706$vx_workli st_thread [13591] | 9746992| 8|OBJT |LOCAL|0| .rodata|S$707$vx_workli st_process [13593] | 9747000| 8|OBJT |LOCAL|0| .rodata|S$708$vx_workth read_set [34118] | 991200| 232|FUNC |GLOB |0| .text|vx_worklist_enq ueue [27664] | 1940288| 96|FUNC |GLOB |0| .text|vx_worklist_get tag [23820] | 13229528| 40|OBJT |GLOB |0| .bss|vx_worklist_hig h [22805] | 13182888| 16|OBJT |GLOB |0| .bss|vx_worklist_lk [36804] | 13762256| 40|OBJT |GLOB |0| .bss|vx_worklist_low [39078] | 1744792| 436|FUNC |GLOB |0| .text|vx_worklist_pro cess [33997] | 1745344| 196|FUNC |GLOB |0| .text|vx_worklist_thr ead [23090] | 12350056| 8|OBJT |GLOB |0| .sbss|vx_worklist_thr ead_sv [36954] | 1745232| 84|FUNC |GLOB |0| .text|vx_worklist_wak eup [31238] | 2034928| 48|FUNC |GLOB |0| .text|vx_workthread_c reate [13579] | 7215680| 232|FUNC |LOCAL|0| .text|vx_workthread_s et



10-9

10–3. SLIDE: JFS Extents

Student Notes JFS allocates space to files in the form of extents - adjacent blocks of disk space treated as a unit. Extents can vary in size from a single block (minimum 1 KB in size) to many megabytes. Organizing file storage in this manner allows JFS to better support large I/O requests, with more efficient reading and writing to continuous disk space areas. JFS extents are represented by a starting block number and a block count. In the example on the slide, the first extent starts at block 40 and contains a length of 128 blocks (or 128 KB, assuming blocks are 1KB in size). When the file grew past the 128 KB size, JFS tried to increase the size of the last extent. Since another file was already occupying this location, a new extent was allocated, starting at block 200. This extent grew to a size of 64 KB, before encountering another file. At this point, a third extent was allocated at block 8. Initially, 8 KB were allocated to the third extent, but upon closing the file, any space not used by the last extent is returned to the operating system. Since only 5 KB were used, the extra 3 KB were returned.

JFS Extents

4012820064

85

. . .

Extent 1

Extent 2

Extent 3

StartLength

JFS Inode(data pointers)

Disk

DifferentFiles



Direct and Indirect Extents in Version 2 Disk Layout

Unlike the HFS inode, the vxfs inode is 184 (rather than 128) bytes long and contains direct and indirect pointers. In the HFS inode, the pointers address data blocks (8K by default) with 12 direct pointers and 3 additional indirect pointers for single, double and triple indirection. In reality triple indirection is rarely needed. Mapping large files in HFS is complex due to the levels of indirection needed to address many 8K blocks. The JFS (vxfs) inode has 10 direct pointers and three additional pointers for single, double, and triple indirect addressing. The pointers no longer address single blocks of data but rather large extents of data. It is unlikely that any indirect pointers will be needed at all as the 10 direct pointers can define large spaces due to the variant length of the extents themselves.

Version 3 and Version 4 Extent Mapping.. “Typed Extents”.

The above discussion is true only for Version 2 disk layout. In addition to the above, in V3/V4 we also have “Typed Extents” which basically allow any level of indirection allowing very large files to be created from many small extents if required (this is not desirable however!). Version 2 also imposes the limit that all indirect extents be the same size (direct extents can be variable in length). In V3/V4 we can have indirect extents of any size mix. V3/V4 will always attempt to use the simplest approach. We use the 10 direct pointers when we can! Inodes are converted to Typed Indirect when the file exceeds the capability of 10 direct extents.



10-11

10–4. SLIDE: Extent Allocation Policies

Student Notes

Disk Space Allocation: The Block Size

Disk space is allocated by the system in 1024 byte device blocks (DEV_BSIZE). An integral number of device blocks are grouped together to form a file system block. VxFS supports file system block sizes of 1024, 2048, 4096, and 8192 bytes. The default block size is: • 1024 bytes for file systems less than 8 gigabytes; • 2048 bytes for file systems less than 16 gigabytes; • 4096 bytes for file systems less than 32 gigabytes; • 8192 bytes for file systems 32 gigabytes or larger. The block size may be specified as an argument to the mkfs or newfs utility and may vary between VxFS file systems mounted on the same system. VxFS allocates disk space to files in extents. An extent is a set of contiguous blocks (up to 2048 blocks in size).

Extent Allocation Policies

• Disk space allocation: the Block Size can be 1K, 2K, 4K, 8K.

• Extents are predefined in free space - “Power of 2 Rule”

• Preferred allocation rules

• Largest single extent is 16MB (with 8K block size).

• Full use of Single Indirection in default HFS would also be 16MB

• VxFS supports large files without indirection.



Extents in Free Space - “Power of 2 Rule”

Free space is described by bitmaps in each allocation unit. The allocation units are split into 16 “sections”. Each section has a series of bitmaps that represent all the possible extents with sizes from 1 block to 2048 blocks by powers of 2. The first bitmap represents all the blocks in the section as one block extents, the second as two block extents, the third as four block extents, etc. The first bitmap, of 2048 bits, represents the section as 2048 one-block extents. The second bitmap, of 1024 bits, represents the section as 1024 two-block extents. This continues for all powers of 2 up to the single bit that represents one 2048 block extent. The file system uses this bitmapping scheme to find an available extent closest in size to the space required. This keeps files as contiguous as possible for faster performance. The largest possible extent on a file in a VxFS file system (with the largest block size of 8 KB) is 2048 * 8 KB = 16 MB.

Preferred Allocation

The following rules are satisfied wherever possible starting with the preferred rules at the top and working down to less preferred rules. • Allocate files using contiguous extent of blocks • Attempt to allocate each file in one extent of blocks • If not possible, attempt to allocate all extents for a file close to each other • If possible, attempt to allocate all extents for a file in the same allocation unit An allocation unit is an amount of contiguous (and therefore close together) file system space equal to 32 MB in size. It is roughly analogous to the HFS cylinder group, but is not dependent on the geometry of the disk drive in any way.



10-13

10–5. SLIDE: JFS Intent Log

Student Notes A key advantage of JFS is that all file system transactions are written to an Intent Log. The logging of file system transactions helps to ensure the integrity of the file system, and allows the file system to be recovered quickly in the event of a system crash.

How the Intent Log Works

When a change is made to a file within the file system, such as a new file being created, a file being deleted, or a file being updated, a number of updates must be made to the superblock, inode table, bit maps, and other structures for that file system. These changes are called metadata updates. Typically, there are multiple metadata updates, which take place every time a change is made to a file. With JFS, after every successful file change (also called a transaction), all the metadata updates related to that transaction get written out to a JFS Intent Log. The purpose of the Intent Log is to hold all completed transactions that have not yet been flushed out to disk. If the system were to crash, the file system could quickly be recovered by checking the file system and applying all transactions in the intent log. Since only completed transactions are logged, there is no risk of a file change being only partially updated (i.e. only some metadata

JFS Intent Log

= meta data update in memory (i.e. superblock or inode table update)

= JFS Intent Log Write

= Sync

= System Crash



updates related to the transactions being logged, and other metadata updates related to the same transaction not being logged). The logging of only COMPLETED transactions prevents the file system from being out-of-sync due a crash occurring in the middle of a transaction. Either the entire transaction is logged or none of the transaction is logged. This allows the JFS intent log to be used in a recovery situation as opposed to a standard fsck. The JFS recovery is done in seconds, as opposed to a standard fsck that (on a big file system) could take minutes, or even hours.

Example

Using the example on the slide, assume that each file transaction requires from one to four metadata updates. After each successful file transaction, all the related metadata updates are written to the JFS intent log. After 30 seconds, all the metadata updates are written out to disk by the sync daemon, and a corresponding DONE record is written to the JFS intent log for each JFS transaction that was flushed during the sync. The system can now reuse that space in the JFS intent log for new JFS transactions. When a crash occurs (in our example, in the middle of a file transaction), the uncompleted transaction never has any metadata written to the JFS intent log; therefore only one transaction is in the JFS intent log since the last sync. Only this transaction needs to be redone and then the file system is recovered and in a stable state. Compare this with having to do a standard fsck.

Performance Impacts

The intent log size is chosen when a file system is created and cannot be subsequently changed. The mkfs utility uses a default intent log size of 1024 blocks. The default size is sufficient for most workloads. If the system is used as an NFS server, for intensive synchronous write workloads, or for dynamic “work file loads” with many metadata changes, performance may be improved using a larger log size. File data is not normally written to the intent log. However, if the application has designated to do synchronous writes and the writes are 32 KB or smaller, the file data will be written to the intent log, along with the meta-data. This behavior can be modified by mount options (discussed later in this module). With larger intent log sizes, recovery time is proportionately longer and the file system may consume more system resources (such as memory) during normal operation. There are several system performance benchmark suites for which VxFS performs better with larger log sizes. As with block sizes, the best way to pick the log size is to try representative system loads against various sizes and pick the fastest. The performance degradation occurs when the entire JFS intent log becomes filled with pending JFS transactions. In these situations, all new JFS transactions must wait for DONE records to arrive for the existing JFS transactions. Once the DONE records arrive, the space used by the corresponding transactions can be freed and reused for new transactions. Having to wait for DONE records to arrive can significantly decrease performance with JFS. In these cases, it is suggested the JFS file system be reinitialized with a larger JFS intent log.



10-15

CAUTION: Network file systems (NFS) can generate a large number of metadata

updates if accessed currently by multiple systems. For JFS file systems being exported for network access via NFS, it is strongly recommended these file systems have an intent log size of 16 MB (maximum size for intent log).



10–6. SLIDE: Intent Log Data Flow

Student Notes The following slide shows a graphical representation of how JFS transactions are processed. System call is issued (for example, write call). 1. All in-memory data structures related to the transaction are updated. These in-memory

structures would include the superblock, the inode table, and the bitmaps. 2. Once the in-memory structures are updated, a JFS transaction is packaged containing the

modifications to the in-memory structures. This packaged transaction contains all the data needed to reproduce the transaction (should that be necessary).

3. Once the JFS transaction is created, it is written to the intent log. (When it is written

depends on mount options.)

At this point, control is returned to the system call. 4. Since the transaction is now stored on disk (in the intent log), there is no hurry to flush

the in-memory data structures to their corresponding disk-based data structures.

Intent Log Data Flow

Superblock Inodes Bitmaps

Inode Allocation Unit

Inode

Allocation Unit

Inode

Allocation Unit

Inode

Allocation Unit

SB Intent Log

Buffer Cache JFSTransaction

Memory

Disk

Process

1

2

3

5

4



10-17

Therefore, the in-memory structures are transferred to the buffer cache, and the sync daemon flushes out these transactions within the next 30 seconds.

5. After the metadata structures are flushed out, a DONE record is written to the intent log

indicating the transaction has been updated to disk, and the corresponding transaction no longer needs to be kept in the intent log.



10–7. SLIDE: Understand Your I/O Workload

Student Notes

Understand your I/O Workload

Tuning the file system’s parameters to optimize performance can only be done effectively when you know what type of I/Os the application is doing. It would be wrong to tune for large block size and maximum contiguous space allocation if the application does many small random I/Os to many small files.

Data Intensive?

Commercial data base applications generally deal with very large files in the table space and large I/Os to those files. Any high degree of small random I/O should be taken care of by the data base’s own buffers (System Global Area) and the HP-UX buffer cache (if it is being used). We may choose to increase the block size in this situation and tune for maximum read ahead/write behind. The following slides will cover this type of tuning.

Understand your I/O Workload

• Is it data-intensive?– few files, large chunks being shuffled around

• Is it attribute-intensive?– many files, small chunks being shuffled

• Is the access pattern random or sequential I/O?– Check for read(), write(), and lseek() system calls

• What is the bandwidth and size of the I/Os?– Are these consistent?– Spindles Win Prizes! LVM or VxVM Stripes– Use XP Disk Arrays



10-19

Attribute Intensive?

Some applications generate many small I/Os to many small files. In this situation a large block size and maximum read ahead/write behind would be inappropriate, generating more I/O than is necessary. A Mail Server or Web Server could be regarded as such an application.

Sequential or Random IO?

We need to characterize the I/O from an application into sequential or random types. Again, in general sequential I/O will benefit from larger block size and continuous files and random I/O will require smaller block size to increase the number of blocks that can be maintained in the buffer cache. With sequential I/O we are more interested in maximizing the MB/sec throughput of the disk (as seen with sar or glance etc). With random I/O we will be looking to the “I/Os Per Second” metrics associated with the disk (r+w/s in sar). Remember that the fastest random I/Os we do are the ones that never go to the disk (!) because they are in the buffer cache (we hope). The Direct I/O feature of OnlineJFS 3.5 is an attempt to recognize when I/Os to a file are very large and sequential. Direct I/O will then attempt to bypass the buffer cache to benefit the generator of the large I/Os in question. Most applications are not designed to handle their own buffering and will lose a great deal of performance if they attempt to use Direct I/O.

Disk Bandwidth

In the end we can only get so much performance out of a single spindle. Modern fast disks (10,000+ RPM, 5ms access time) can only provide an absolute maximum of approx 10 MB/s for very sequential I/O and around 150 I/Os Per Second. Once your file system is extracting these sorts of numbers (or even 50% of them!) you can consider that the hardware has become the limiting factor. Stop tuning and buy more disks! Remember that “spindles win prizes”. LVM or VxVM striping will help in this situation as the single spindle performance can be aggregated by the number of spindles. Using expensive RAID technology like the HP XP256, XP512, or XP1024 Disk Arrays will also improve apparent “spindle” performance. The author has seen a single XP512 logical device provide a sustained 60MB/s read performance for sequential I/O and over 1500 I/Os per second for a single threaded random application test to a single logical device.



10–8. SLIDE: Performance Parameters

Student Notes We will discuss the following choices over the next slides. Note that the some parameters can only be set when the file system is created.

• At file system creation time (only):

− Choosing a Block Size

− Choosing an Intent Log Size

• After file system creation:

− Choosing Mount Options

− Kernel Tunables

− Kernel Inode Table Size

− Monitoring Free Space and Fragmentation

− Changing extent attributes on individual files

− I/O Tuning

− Tunable VxFS I/O Parameters

Performance Parameters

Things that an administrator can change to optimize JFS:• Choosing a Block Size• Choosing an Intent Log Size• Choosing Mount Options• Kernel Tunables

– Internal Inode Table Size• Monitoring Free Space and Fragmentation• Changing extent attributes on individual files• I/O Tuning

– Tunable VxFS I/O Parameters– Command Line– Configuration file (/etc/vx/tunefstab)



10-21

10–9. SLIDE: Choosing a Block Size

Student Notes You specify the block size when a file system is created; it cannot be changed later. The standard HFS file system defaults to a block size of 8K with a 1K fragment size. This means that space is allocated to small files (up to 12 blocks) in 1K increments. Allocations for larger files are done in 8K increments. Because many files are small, the fragment facility saves a large amount of space compared to allocating space 8K at a time. The unit of allocation in VxFS is a block. There are no fragments because storage is allocated in extents that consist of one or more blocks. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on devices of less than 8 gigabytes. Choose a block size based on the type of application being run. For example, if there are many small files, a 1K block size may save space. For large file systems, with relatively few files, a larger block size is more appropriate. The trade-offs of specifying larger block sizes are: 1) a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, 2) an increase in the maximum extent size, and 3) a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not a multiple of the block size.

Choosing a Block Size

• Choose the right block size for the application.• Consider maximum block size (8K) for large file data base

– Small files will waste space– System overhead will be less– Files approaching 1GB are “large”

• Consider minimum block size (1K) for small file mail server or web server

– More system overhead if files are large• Use large block size for sequential I/O application• Use small block size for random I/O application



Larger block sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size. The easiest way to judge which block sizes provide the greatest system efficiency is to try representative system loads against various sizes and pick the fastest.

Specifying the Block Size

The following newfs command creates a VxFS file system with the maximum block size and support for large files.

# newfs –F vxfs –b 8192 –o largefiles /dev/vgjfs/rlvol1 The block size for files on the file system represents the smallest amount of disk space that can be allocated to a file. It must be a power of 2 selected from the range 1024 to 8192. The default is 1024 for file systems less than 8 gigabytes, 2048 for file systems less than 16 gigabytes, 4096 for file systems less than 32 gigabytes, and 8192 for larger file systems.



10-23

10–10. SLIDE: Choosing an Intent Log Size

Student Notes The intent log size is chosen when a file system is created and cannot be changed afterwards. The default intent log size chosen by mkfs is 1024 blocks and is suitable in most situations. For some types of applications (NFS server or intensive synchronous write loads), performance may be improved by increasing the size of the intent log. Note that recovery time will also be proportionally longer as the log size increases. Memory requirements for the log maintenance will also increase as the log size increases. Ensure that the log size is not more than 50% of the physical memory size of the system or fsck will not be able to fix it after a system crash. Ideal log size for NFS is 2048 with a file system block size of 8192.

Choosing an Intent Log Size

• Intent log size cannot be changed after file system creation• mkfs applies a default log size of 1024 blocks• Performance may improve as using a larger log size

– NFS server will benefit from a 16MB (largest) log size– Synchronous write intensive applications



Specifying the Intent Log Size

To create a VxFS file system with a default block size and a 16MB intent log:

# newfs –F vxfs –o logsize=16384 /dev/vgjfs/rlvol1 “-o logsize=” specifies the number of file system blocks to allocate for the transaction-logging area. It must be in the range of 32 to 16384 blocks. The minimum number for Version 2 disk layouts is 32 blocks. The minimum number for Version 3 and Version 4 disk layouts is the number of blocks that make the log no less than 256K. If the file system is: • greater than or equal to 8MB, default is 1024 blocks • greater than or equal to 2MB, and less than 8 MB, default is 128 blocks • less than 2MB, default is 32 blocks While logsize is specified in blocks, the maximum size of the intent log is 16384 KB. This means the maximum values for logsize are: • 16384 for a block size of 1024 bytes • 8192 for a block size of 2048 bytes • 4096 for a block size of 4096 bytes • 2048 for a block size of 8192 bytes



10-25

10–11. SLIDE: Intent Log Mount Options

Student Notes JFS offers mount options to delay or disable transaction logging to the intent log. This allows the system administrator to make trade-offs between file system integrity and performance. Following are the logging options: Mount Option Description Full logging (log) File system structural changes are logged to disk before the

system call returns to the application (synchronously). If the system crashes, fsck(1M) will complete logged operations that have not completed.

Delayed logging Some system calls return before the intent log is written. This (delaylog) improves the performance of the system, but some changes are

not guaranteed until a short time later when the intent log is written. This mode approximates traditional UNIX system guarantees for correctness in case of system failure.

Temporary logging The intent log is almost always delayed. This improves

Intent Log Mount Options

• Full logging* log• Delayed logging delaylog• Temporary logging tmplog• No logging nolog• Disallow small sync I/Os in log nodatainlog (50% perf cost!)• Force clear new file blocks blkclear (10% perf cost!)

*Note only the first option is default for mount



(tmplog) performance, but recent changes may disappear if the system crashes. This mode is only recommended for temporary file systems.

No logging (nolog) The intent log is disabled. The other three logging modes

provide for fast file system recovery; nolog does not provide fast file system recovery. With nolog mode, a full structural check must be performed after a crash. This may result in loss of substantial portions of the file system, depending upon activity at the time of the crash. Usually, a nolog file system should be rebuilt with mkfs(1M) after a crash. The nolog mode should only be used for memory resident or very temporary file systems.

nodatainlog The nodatainlog mode should be used on systems with disks

that do revectoring. Normally, a VxFS file system uses the intent log for synchronous writes. The inode update and the data are both logged in the transaction, so a synchronous write only requires one disk write instead of two. When the synchronous write returns to the application, the file system has told the application that the data is already written. If a disk error causes the data update to fail, then the file must be marked bad and the entire file is lost. If a disk supports bad block revectoring, then a failure on the data update is unlikely, so logging synchronous writes should be allowed. If the disk does not support bad block revectoring, then a failure is more likely, so the nodatainlog mode should be used. A nodatainlog mode file system should be approximately 50 percent slower than a standard mode VxFS file system for synchronous writes. Other operations are not affected.

blkclear The blkclear mode is used in increased data security environments. The blkclear mode guarantees that uninitialized storage never appears in files. The increased integrity is provided by clearing extents on disk when they are allocated to a file. Extending writes are not affected by this mode. A blkclear mode file system should be approximately 10 percent slower than a standard mode VxFS file system, depending on the workload.



10-27

10–12. SLIDE: Other JFS Mount Options

Student Notes

Understanding asynchronous, data synchronous (O_DSYNC) and fully synchronous (O_SYNC) application I/O.

When an application program opens a file with the open() system call, the programmer makes a decision on how the I/Os will occur between the application memory and the file system. The following three options are available, in order, ranging from highest performance (lowest integrity) to lowest performance (best integrity). In this discussion “integrity” refers to the potential damage to file system structures and customer data during a system crash.

1. Asynchronous I/O Standard Mode High performance / Low integrity In asynchronous mode, all application I/Os are done to buffer cache including data and inode modifications. The write() system call will return quickly to the application which can continue “in faith” that the data will make it to the disk. Data integrity will be fully compromised by a system crash and new “just created” files may even disappear.

Other JFS Mount Options

mincache options (buffer cache)• closesync*

• direct

• dsync

• unbuffered

• tmpcache

convosync options (synchronous I/O)• closesync

• direct

• dsync

• unbuffered

• delay

* NOTE: This is the only additional option available with BaseJFS, all other options require OnlineJFS.



2. Data Synchronous I/O O_DSYNC Low performance / Good integrity If the file is opened with the O_DSYNC flag, the file is in “Data Synchronous” mode. In this situation, write() system calls that modify data do not return until the disk has acknowledged the receipt of the data. However, some inode changes (time stamps, etc.) are still performed asynchronously and may not have arrived at the disk in the case of a system crash.

3. Synchronous I/O O_SYNC Lowest performance / Best integrity Fully synchronous behavior is obtained by opening the file with O_SYNC. All operations are now synchronous and write() system calls block for both data and inode modifications. Minimal damage will now occur in the event of a system crash.

mincache vs. convosync

mincache manipulates the behavior of the buffer cache. All of the mincache options except mincache=closesync require the OnlineJFS product (see slide). convosync (“convert osync”) changes the behavior of data synchronous (O_DSYNC) and synchronous (O_SYNC) writes. All convosync options require OnlineJFS. The mincache and convosync options generally control the integrity of the user data, where the log options (log, delaylog, tmplog, nolog) control the integrity of the metadata only.

mincache

mincache=closesync Flush data to disk synchronously when file is closed. The mincache=closesync mode is useful in desktop environments where users are likely to shut off the power on the machine without halting it first. In this mode, any changes to the file are flushed to disk synchronously when the file is closed. To improve performance, most file systems do not synchronously update data and inode changes to disk. If the system crashes, files that have been updated within the past minute are in danger of losing data. With the mincache=closesync mode, if the system crashes or is switched off, only files that are currently open can lose data. A mincache=closesync mode file system should be approximately 15 percent slower than a standard mode VxFS file system, depending on the workload. mincache=direct Bypass the buffer cache for all data and inode changes, forces

fully synchronous behavior and totally skips buffer cache. mincache=unbuffered Bypass the buffer cache for data only. Inode changes are

cached. Forces data synchronous-like behavior with no data in cache.



10-29

mincache=dsync Equivalent to normal data synchronous behavior. Write does not return until data is on disk but data does go through buffer cache.

The mincache=direct, mincache=unbuffered, and mincache=dsync modes are used in environments where applications are experiencing reliability problems caused by the kernel buffering of I/O and delayed flushing of non-synchronous I/O. The mincache=direct and mincache=unbuffered modes guarantee that all non-synchronous I/O requests to files will be handled as if the VX_DIRECT or VX_UNBUFFERED caching advisories had been specified. The mincache=dsync mode guarantees that all non-synchronous I/O requests to files will be handled as if the VX_DSYNC caching advisory had been specified. Refer to vxfsio(7) for explanations of VX_DIRECT, VX_UNBUFFERED, and VX_DSYNC. The mincache=direct, mincache=unbuffered, and mincache=dsync modes also flush file data on close as mincache=closesync does. mincache=tmpcache Speeds up file growth by breaking data initialization rules. The -o mincache=tmpcache option only affects write extending calls and is not available to files performing synchronous I/O. write extending calls refer to write calls that cause new file system blocks to be assigned to the file, extending the size of the file in blocks. The normal behavior for write extending calls is to write the new user data first, and insist on metadata to be written only after the user data. Write extending calls are expensive from a performance standpoint, because the write call has to wait for the user data and the metadata to be written. A non-extending write call only requires the call to wait for the metadata. With the -o mincache=tmpcache option, write extending calls do not have to wait for the user data to be written. This option allows the metadata to be written before user data (and the write call to return before the user data is written), significantly improving performance. CAUTION: The -o mincache=tmpcache option significantly increases the

likelihood of non-initialized file system blocks (i.e. junk) appearing in files during a system crash. This is due to the file pointing to data blocks before the data is actually there. If the system crashes between the file's inode being updated (done first) and the user data being written (done second), then uninitialized data will appear in the file. The tmpcache option should only be used for memory resident or very temporary file systems.



convosync

NOTE: Use of the convosync=dsync option violates POSIX guarantees for

synchronous I/O. The “convert osync” (convosync) mode has five values: convosync=closesync, convosync=direct, convosync=dsync, convosync=unbuffered, and convosync=delay. The convosync=closesync mode converts synchronous and data synchronous writes to non-synchronous writes and flushes the changes in the file to disk when the file is closed. The convosync=delay mode causes synchronous and data synchronous writes to be delayed rather than to take effect immediately. No special action is performed when closing a file. This option effectively cancels any data integrity guarantees normally provided by opening a file with O_SYNC. See open(2), fcntl(2), and vxfsio(7) for more information on O_SYNC. Caution! Extreme care should be taken when using the

convosync=closesync or convosync=delay mode because they actually change synchronous I/O into non-synchronous I/O. This may cause applications that use synchronous I/O for data reliability to fail, if the system crashes and synchronously written data is lost.

The convosync=direct and convosync=unbuffered mode convert synchronous and data synchronous reads and writes to direct reads and writes, bypassing the buffer cache. The convosync=dsync mode converts synchronous writes to data synchronous writes. As with closesync, the direct, unbuffered, and dsync modes flush changes in the file to disk when it is closed. These modes can be used to speed up applications that use synchronous I/O. Many applications that are concerned with data integrity specify O_SYNC in order to write the file data synchronously. However, this has the undesirable side effect of updating inode times and therefore slowing down performance. The convosync=dsync, convosync=unbuffered, and convosync=direct modes alleviate this problem by allowing applications to take advantage of synchronous writes without modifying inode times as well. NOTE: Before using convosync=dsync, convosync=unbuffered, or

convosync=direct, make sure that all applications that use the file system do not require synchronous inode time updates for O_SYNC writes.



10-31

10–13. SLIDE: JFS Mount Option: mincache=direct

Student Notes The above slide illustrates the impact of setting the -o mincache=direct option. By default, all JFS file system I/O goes through the system's buffer cache. When an application does its own caching (e.g. an Oracle database application), there are two levels of caching. One cache is managed by the application; the other cache is managed by the kernel. Using two caches is inefficient from both a performance and a memory usage standpoint (data exists in both caches). When the file system is mounted with the -o mincache=direct option, it causes bypassing of the system's buffer cache and the data is written directly to disk. This improves performance and keeps the buffer cache available for other file systems that do not go through an application cache.

JFS Mount Option: mincache=direct

Buffer Cache

OracleProcess

SGA Database CacheORACLEDatabase

Buffer Cache

OracleProcess

SGA Database CacheORACLEDatabase

Data Flow with mount option mincache=direct

Data Flow with defaultmount options



CAUTION: Use of the -o mincache=direct option can lead to a significant decrease in performance if used in the wrong situation. This option should only be used if: 1. An application creates and maintains its own data cache, and 2. All the files on the file system are cached in the application's data cache. If there are some files being accessed on the mounted file system and these files are not being cached by the application, this option should not be used.

NOTE: This option is only available with the OnlineJFS product.



10-33

10-14. SLIDE: JFS Mount Option: mincache=tmpcache

Student Notes By default, when a process performs a write extending call, the new data is written to disk before the file's inode is updated. In the slide above, the left side shows the default behavior:

1. Write data to newly allocated file system block.

2. Write JFS transaction meta-data out to the disk. The system call returns.

The advantage of this behavior is that uninitialized data will not be found within the file should a system crash occur. This is important from a data integrity standpoint. The disadvantage of this behavior is slow performance, because the JFS transaction must wait for the user data I/O to complete before it can be written to the intent log.

Behavior with -o mincache=tmpcache Option

Performance can be improved (at the expense of data integrity) by mounting file systems with the -o mincache=tmpcache option. This option allows the JFS transactions to be written to the intent log before the user data is written to the file. In the slide, the right side shows the tmpcache behavior:

JFS Mount Option: mincache=tmpcache

SB Inode AU

Inode

Allocation Unit

SB Intent Log

Buffer Cache

JFSTransaction

Memory

Disk

Process

File

Allocation Unit

SB Inode AU

Inode

Allocation Unit

SB Intent Log

Buffer Cache

JFSTransaction

Memory

Disk

Process

File

Allocation Unit

1

2

2

1

default mincache=tmpcache



1. Write JFS transaction out to disk. (The system call returns).

2. Write data to newly allocated file system block.

The advantage of this behavior is performance of write extending calls is fast. The system does not wait for the user data to be written to disk. The disadvantage of this behavior is data integrity of the file is jeopardized, especially if the file is being updated at the time of a system crash. By updating the file's inode first, the file points to uninitialized data blocks which contains unknown data. The uninitialized file system blocks are expected to be initialized soon after the inode is updated; however, there still exists a small window of time when the file's inode references unknown data. If the system crashes during this small window, then the file will still be referencing the uninitialized data after the crash. CAUTION: The -o mincache=tmpcache option should only be used for

memory resident or very temporary file systems.



10-35

10–15. SLIDE: Kernel Tunables

Student Notes

Internal Inode Table Size

VxFS caches inodes in an inode table (see Table below, “Inode Table Size”). There is a tunable in VxFS called vx_ninode that determines the number of entries in the inode table. A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system for example). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode is set to zero. The kernel then computes a value based on the system memory size.

Kernel Tunables

• VxFS inodes are cached in memory, separate from HFS.• Kernel parameter ninode has no effect on VxFS.• When vx_ninode is zero (default), inode cache is set in proportion to

system memory (see table).• vx_ncsize sets directory name lookup cache (1KB)



If the available memory is a value between two entries, the value of vx_ninode is interpolated.

Other VxFS Kernel Parameters

vx_ncsize Controls the size of the DNLC (directory name lookup cache) in the kernel. Recent directory path names are stored in memory to improve

performance. This parameter is set in DNLC entries. The size of the DNLC is set to the sum of ninode and vx_ncsize.

Total Memory in Mbytes MaximumNumber of Inodes

8 400 16 1000 32 2500 64 6000 128 8000 256 16000 512 32000

1024 64000 2048 128000 8192 256000

32768 512000 131072 1024000



10-37

10–16. SLIDE: Fragmentation

Student Notes

• Keep file system free space over 10% In general, VxFS works best if the percentage of free space in the file system does not get below 10 percent. This is because file systems with 10 percent or more free space have less fragmentation and better extent allocation. Regular use of the df(1M) command to monitor free space is desirable. Full file systems should therefore have some files removed, or should be expanded (see fsadm(1M) for a description of online file system expansion).

• Maintain free space distribution goals 3 factors which can be used to determine the degree of fragmentation:

• percentage of free space in extents < than 8 blocks in length • percentage of free space in extents < than 64 blocks in length • percentage of free space in extents of 64 blocks or greater

An unfragmented file system will have the following characteristics:

Fragmentation

• Keep file system free space over 10%• Maintain free space distribution goals

– Monitor with df(1M) or fsadm(1M)• Repack files and free space with fsadm –e

– Reduces the number of extents in large files– Makes small files contiguous (one extent)– Moves small recently used file closer to inode structures– Optimizes free space into larger extents

• Repack directories with fsadm –d– Remove empty entries from directories– Place recently used files at beginning of directory lists– Pack small directories directly in inode if possible



• less than 1% of free space in extents < 8 blocks in length • less than 5% of free space in extents < 64 blocks in length • more than 5% of total file system size available as free extents in length of 64 or more

blocks A fragmented file system will have the following characteristics: • greater than 5% of free space in extents < 8 blocks in length • more than 50% of free space in extents < 64 blocks in length • less than 5% of total file system size available as free extents in lengths of 64 or more

blocks in size

Using df(1M) The following example shows how to use df to map free space: # df -F vxfs -o s /usr /usr (/dev/vg00/lvol7 ) : Free Extents by Size 1: 823 2: 206 4: 55 8: 206 16: 158 32: 61 64: 48 128: 43 256: 23 512: 14 1024: 3 2048: 3 4096: 1 8192: 1 16384: 0 32768: 0

• Repack files and freespace fsadm –e has the following goals for files and free data space

• Make “small” files (default: <64k) one contiguous extent • Ensure that “large” files are built from large extents • Move “small” and “recently used” (default: <14 days) files near inode area • Move “large” or “old” (>14 days since last access) files to end of allocation unit • Consolidate free space in center of data area

• Repack directories fsadm –d has the following goals for directories

• Remove unused space from between used directory entries • Pack directories and symbolic links into inode immediate area if • possible • Place directories and symbolic links first, then other files • Sort each area by time of last access

fsadm(1M) Overview

Because blocks are allocated and deallocated as files are added, removed, expanded, and truncated, block space can become fragmented. This can make it more difficult for JFS to take advantage of the benefits provided by a contiguous extent allocation. To remove fragmentation, HP OnlineJFS includes a utility called fsadm, which will take fragmented blocks and reallocate them as contiguous extents. The fsadm utility can be run on a live file system (including one containing active databases) safely without interrupting data access.



10-39

The fsadm utility will bring the fragmented extents of files closer together, group them by type and frequency of access, and compact and sort directories. The fsadm utility is typically run as a recurring scheduled job and is an effective tool for the management of a high-performance online file system. Even if database software used on top of the file system has its own defragmenter, this additional defragmentation is necessary to make the storage that the database engine sees as contiguous as possible. You can defragment (reorganize) your HP OnlineJFS file system using SAM or with fsadm(1M), directly from the command line. To use SAM:

1. Invoke SAM.

2. Select the Disks and File Systems functional area.

3. Select the File Systems application.

4. Select the JFS file system that you wish to reorganize from the directories' list.

5. Select the Actions menu.

6. Select the VxFS Maintenance menu item.

7. View reports on extent and directory fragmentation, then select Reorganize Extents or Reorganize Directories to defragment your JFS file system.



10–17. TEXT PAGE: Monitoring and Repairing File Fragmentation For optimal performance, the JFS extent allocator must be able to find large extents when it wants them. To maintain the file system performance levels, the fsadm utility should be run periodically against all JFS file systems to reduce fragmentation. The fsadm utility should be run between once a day and once a month against each file system. The frequency depends on file system usage and activity patterns and the importance of performance. The -v option can be used to examine the amount of work performed by fsadm. The frequency of reorganization can be adjusted, based on the rate of file system fragmentation. To perform both directory and extent reorganization and to output reports on the directory and extent fragmentation before and after reorganization, enter the following:

# fsadm -F vxfs -d -D -e -E /<jfs_mount_point>

Reorganizing Options

-F vxfs Specify the JFS file system type. -D Report on directory fragmentation. If specified in conjunction with the -d

option, the fragmentation report is produced both before and after the directory reorganization.

-E Report on extent fragmentation. If specified in conjunction with the -e option,

the fragmentation report is produced both before and after the extent reorganization.

-d Reorganize directories. Directory entries are reordered to place subdirectory

entries first, then all other entries in decreasing order of time of last access. The directory is also compacted to remove free space.

-e Extent reorganization. Attempt to minimize fragmentation. Aged files are

moved to the end of the allocation units to produce free space. Other files are reorganized to have the minimum number of extents possible.

-s Print a summary of activity at the end of each pass. -v Verbose. Report reorganization activity. -a days Consider files not accessed within the specified number of days as aged files.

The default is 14 days. Aged files are moved to the end of the directory by the -d option and reorganized differently by the -e option.

-p passes Maximum number of passes to run. The default is 5 passes. Reorganizations

are processed until reorganization is complete or until the specified number of passes have been run.

-t time Maximum time to run. Reorganizations are processed until reorganization is

complete or the time limit has expired. time is specified in seconds.



10-41

If both the -t and -p options are specified, the utility exits if either of the terminating conditions is reached. If both the -e and -d options are specified, the utility will run all the directory reorganization passes before any extent reorganization passes. fsadm uses the file .fsadm in the lost+found directory as a lock file. When fsadm is invoked, it opens the file lost+found/.fsadm in the root of the file system specified by mount_point . If the file does not exist, it is created. The fcntl(2) system call is used to obtain a write lock on the file. If the write lock fails, fsadm will assume that another fsadm is running and will fail. fsadm will report the process ID of the process holding the write lock on the .fsadm file.

Reporting on Directory Fragmentation

As files are allocated and freed, directories tend to grow and become sparse. In general, a directory is as large as the largest number of files it ever contained, even if some files have been subsequently removed. The command line to obtain a directory fragmentation report is:

# fsadm -D /mountpoint_dir The following is example output from the fsadm -D command:

# fsadm -D /home Directory Fragmentation Report

Dirs Total Immed Immeds Dirs to Blocks to Searched Blocks Dirs to Add Reduce Reduce au 0 15 3 12 0 0 0 au 1 0 0 0 0 0 0 total 15 3 12 0 0 0

The Dirs Searched column contains the total number of directories. A directory is associated with the extent-allocation unit containing the extent in which the directory's inode is located. The Total Blocks column contains the total number of blocks used by directory extents. The Immed Dirs column contains the number of directories that are immediate, meaning that the directory data is in the inode itself as opposed to being in an extent. Immediate directories save space and speed path name resolution. The Immeds to Add column contains the number of directories that currently have a data extent, but that could be reduced in size and contained entirely in the inode. The Dirs to Reduce column contains the number of directories for which one or more blocks can be freed, if the entries in the directory are compressed to make the free space in the directory contiguous. Because directory entries vary in length, large directories may contain a block or more of total free space, but with the entries arranged in such a way that the space cannot be made contiguous. As a result, it is possible to have a non-zero Dirs to



Reduce calculation immediately after running a directory reorganization. The -v (verbose) option of directory reorganization reports occurrences of failure to compress free space. The Blocks to Reduce column contains the number of blocks that can be freed if the entries in the directory are compressed.

Measuring Directory Fragmentation

If the totals in the Dirs to Reduce column are substantial, a directory reorganization should improve the performance of path name resolution. The directories that fragment tend to be the directories with the most activity. A small number of fragmented directories may account for a large percentage of name lookups in the file system.

Directory Reorganization

If the -d option is specified, fsadm will reorganize the directories on the file system whose mount point is mountpoint_dir. Directories are reorganized in two ways: compressing and sorting. For compression, the valid entries in the directory are moved to the front of the directory and the free space is grouped at the end of the directory. If there are no entries in the last block of the directory, the block is released and the directory size is reduced. If the directory entries are small enough, the directory is placed in the inode immediate data area. The entries in a directory are also sorted to improve path name lookup performance. Entries are sorted based on the last access time of the entry. The -a option is used to specify a time interval; 14 days is the default if -a is not specified. The time interval is broken up into 128 buckets, and all times within the same bucket are considered equal. All access times older than the time interval are considered equal, and those entries are placed last. Subdirectory entries are placed at the front of the directory and symbolic links are placed after subdirectories, followed by the most recently accessed files. The directory reorganization runs in one pass across the entire file system. The command line to reorganize directories of a file system is:

fsadm -d [-s] [-v] [-p passes] [-t timeout] [- r rawdev] [-D] /mountpoint_dir

The following example illustrates the output of the command fsadm -d -s command:

# fsadm -d -s /home Directory Reorganization Statistics

Dirs Dirs Total Failed Blocks Blocks Immeds Searched Changed Ioctls Ioctls Reduced Changed Added au 0 2343 1376 2927 1 209 3120 72 au 1 582 254 510 0 47 586 28 au 2 142 26 38 0 21 54 16 au 3 88 24 29 1 5 36 2 total 3155 1680 3504 2 282 3796 118



10-43

The Dirs Searched column contains the number of directories searched. Only directories with data extents are reorganized. Immediate directories are skipped. The Dirs Changed column contains the number of directories for which a change was made. The Total Ioctls column contains the total number of VX_DIRSORT ioctls performed. Reorganization of directory extents is performed using this ioctl. The Failed Ioctls column contains the number of requests that failed. The reason for failure is usually that the directory being reorganized is active. A few failures should be no cause for alarm. If the -v option is used, all ioctl calls and status returns are recorded. The Blocks Reduced column contains the total number of directory blocks freed by compressing entries. The Blocks Changed column contains the total number of directory blocks updated while sorting and compressing entries. The Immeds Added column contains the total number of directories with data extents that were compressed into immediate directories.

Reporting on Extent Fragmentation

As files are created and removed over time, the free extent map for an allocation unit will change from having one large free area to having many smaller free areas. This process is known as fragmentation. Also, when files are grown, particularly when growth occurs in small increments, small files can be allocated in multiple extents. In the ideal case, each file that is not sparse will have exactly one extent (containing the entire file), and the free-extent map is one continuous range of free blocks. Conversely, in a case of extreme fragmentation, there can be free space in the file system, none of which can be allocated. For example, on Version 2 JFS file systems, the indirect-address extent size is always 8 KB long. This means that to allocate an indirect-address extent to a file, an 8-KB extent must be available. If no extent of 8 KB or larger is available, even though more than 8 KB of free space is available, an attempt to allocate a file into indirect extents will fail and return ENOSPC.

Determining Fragmentation

To determine whether fragmentation exists for a given file system, the free extents for that file system need to be examined. If a large number of small extents are free, there is fragmentation. If more than half of the amount of free space is taken up by small extents, (smaller than 64 blocks) or there is less than 5 percent of total file system space available in large extents, then there is serious fragmentation.

Running the Extent-Fragmentation Report

The extent-fragmentation report can be run to acquire detailed information about the degree of fragmentation in a given file system. The following is the command line to run an extent-fragmentation report:

fsadm -E [-l largesize] /mountpoint_dir The extent reorganizer has the concept of an immovable extent: if the file already contains large extents, reallocating and consolidating these extents will not improve performance, so



they are considered immovable. How large an extent must be to qualify as immovable can be controlled with the -l option. By default, largesize is 64 blocks, meaning that any extent larger than 64 blocks is considered to be immovable. For the purposes of the extent fragmentation report, the value chosen for largesize will affect which extents are reported as being immovable extents. The following is an example of the output generated by the fsadm -E command:

# fsadm -E /home

Extent Fragmentation Report

Files with Total Total Extents Extents Blocks Distance Au 0 14381 18607 30516 4440997 au 1 2822 3304 24562 927841 au 2 2247 2884 22023 1382962 au 3 605 780 24039 679867 total 19992 25575 101140 7431667 Consolidatable Immovable Extents Blocks Extents Blocks au 0 928 2539 0 0 au 1 461 5225 99 13100 au 2 729 8781 58 11058 au 3 139 1463 49 17258 total 2257 18008 206 41416

Free Extents by Size

au 0 Free Blocks 217, Smaller Than 8 - 48%, Smaller Than 64 - 100% 1: 15 2: 15 4: 15 8: 14 16: 0 32: 0 64: 0 128: 0 256: 0 512: 0 1024: 0 2048: 0 4096: 0 8192: 0 16384: 0 au 1 Free Blocks 286, Smaller Than 8 - 41%, Smaller Than 64 - 100% 1: 16 2: 21 4: 15 8: 13 16: 4 32: 0 64: 0 128: 0 256: 0 512: 0 1024: 0 2048: 0 4096: 0 8192: 0 16384: 0 au 2 Free Blocks 510, Smaller Than 8 - 15%, Smaller Than 64 - 100% 1: 10 2: 14 4: 10 8: 14 16: 8 32: 6 64: 0 128: 0 256: 0 512: 0 1024: 0 2048: 0 4096: 0 8192: 0 16384: 0 au 3 Free Blocks 6235, Smaller Than 8 - 3%, Smaller Than 64 - 15% 1: 29 2: 33 4: 27 8: 30 16: 18 32: 8 64: 4 128: 3 256: 2 512: 2 1024: 1 2048: 1 4096: 0 8192: 0 16384: 0



10-45

au 4 Free Blocks 8551, Smaller Than 8 - 2%, Smaller Than 64 - 22% 1: 29 2: 33 4: 30 8: 38 16: 28 32: 29 64: 26 128: 11 256: 8 512: 3 1024: 0 2048: 0 4096: 0 8192: 0 16384: 0 total Free Blocks 15799, Smaller Than 8 - 4%, Smaller Than 64 - 24% 1: 99 2: 116 4: 97 8: 109 16: 58 32: 43 64: 30 128: 14 256: 10 512: 5 1024: 1 2048: 1 4096: 0 8192: 0 16384: 0 The numbers in the Files with Extents column indicate the total number of files that have data extents. A file is considered to be in the extent-allocation unit that contains the extent holding the file's inode. The Total Extents column contains the total number of extents belonging to files in the allocation unit. The extents themselves are not necessarily in the same allocation unit. The Total Blocks column contains the total number of blocks used by files in the allocation unit. If the total number of blocks is divided by the total number of extents, the resulting figure is the average extent size. The Total Distance column contains the total distance between extents in the allocation unit. For example, if a file has two extents, the first containing blocks 100 through 107 and the second containing blocks 110 through 120, the distance between the extents is 110–107, or 3. In general, a lower number means that files are more contiguous. If an extent reorganization is run on a fragmented file system, the value for Total Distance should be reduced. The Consolidatable Extents column contains the number of extents that are candidates to be consolidated. Consolidation means merging two or more extents into one combined extent. For files that are entirely in direct extents, the extent reorganizer will attempt to consolidate extents into extents up to size largesize. All files of size largesize or less typically will be contiguous in one extent after reorganization. Since most files are small, this will usually include about 98 percent of all files. The Consolidatable Blocks column contains the total number of blocks in Consolidatable Extents. The Immovable Extents column contains the total number of extents that are considered to be immovable. In the report, an immovable extent appears in the allocation unit of the extent itself, as opposed to in the allocation unit of its inode. This is because the extent is considered to be immovable, and thus permanently fixed in the associated allocation unit. The Immovable Blocks column contains the total number of blocks in immovable extents. The figures under the Free Extents by Size heading indicate per-allocation unit totals for free extents of each size. The totals are for free extents of size 1, 2, 4, 8, 16, . . . up to a maximum of the number of data blocks in an allocation unit. The totals should match the output of df -o s unless there has been recent allocation or deallocation activity (as this utility acts on



mounted file systems). These figures give an indication of fragmentation and extent availability on a per-allocation-unit basis. For each allocation unit, and for the complete file system, the total free blocks and total free blocks by category are shown. The figures are presented as follows:

• The Free Blocks figure indicates the total number of free blocks.

• The Smaller Than 8 figure indicates the percentage of free blocks that are in extents less than 8 blocks in length.

• The Smaller Than 64 figure indicates the percentage of free blocks that are in extents less than 64 blocks in length.

In the preceding example, 4 percent of free space is in extents less than 8 blocks in length, and 24 percent of the free space is in extents less than 64 blocks in length. This represents a typical value for a mature file system that is regularly reorganized. The total free space is about 10 percent.

Extent Reorganization

If the -e option is specified, fsadm will reorganize the data extents on the file system whose mount point is mountpoint_dir. The primary goal of extent reorganization is to defragment the file system. To reduce fragmentation, extent reorganization tries to place all small files in one contiguous extent. The -l option is used to specify the size of a file that is considered large. The default is 64 blocks. Extent reorganization also tries to group large files into large extents of at least 64 blocks. In addition to reducing fragmentation, extent reorganization improves performance. Small files can be read or written in one I/O operation. Large files can approach raw-disk performance for sequential I/O operations. Extent reorganization also tries to improve the locality of reference on the file system. Extents are moved into the same allocation unit as their inode. Within the allocation unit, small files and directories are migrated to the front of the allocation unit. Large files and inactive files are migrated towards the back of the allocation unit. (A file is considered inactive if the access time on the inode is more than 14 days old. The time interval can be varied using the -a option.) Extent reorganization should reduce the average seek time by placing inodes and frequently used data closer together. fsadm will try to perform extent reorganization on all inodes on the file system. Each pass through the inodes will move the file system closer to the organization considered optimal by fsadm . The first pass might place a file into one contiguous extent. The second pass might move the file into the same allocation unit as its inode. Then, since the first file has been moved, a third pass might move extents for a file in another allocation unit into the space vacated by the first file during the second pass. When the file system is more than 90 percent full, fsadm shifts to a different reorganization scheme. Instead of attempting to make files contiguous, extent reorganization tries to defragment the free-extent map into chunks of at least 64 blocks or the size specified by the -l option.



10-47

The following is the command line to perform extent reorganization:

fsadm -F vxfs -e [-sv][-p passes][-t time][-a days][-l largesize] /mountpoint_dir

The following example illustrates the output from the fsadm -F vxfs -e -s command:

# fsadm -F vxfs -e -s

Allocation Unit 0, Pass 1 Statistics Extents Consolidations Performed Total Errors Searched Number Extents Blocks File Busy Not Free au 0 2467 11 30 310 0 0 au 1 0 0 0 0 0 0 au 2 0 0 0 0 0 0 au 3 0 0 0 0 0 0 au 4 0 0 0 0 0 0 total 2467 11 30 310 0 0 In Proper Location Moved to Proper Location Extents Blocks Extents Blocks au 0 1379 8484 794 10925 au 1 0 0 0 0 au 2 0 0 0 0 au 3 0 0 0 0 au 4 0 0 0 0 total 1379 8484 794 10925 Moved to Free Area In Free Area Could not be Moved Extents Blocks Extents Blocks Extents Blocks au 0 231 4851 4 133 0 0 au 1 0 0 0 0 0 0 au 2 0 0 0 0 0 0 au 3 0 0 0 0 0 0 au 4 0 0 0 0 0 0 total 231 4851 4 133 0 0 Allocation Unit 0, Pass 2 Statistics Extents Consolidations Performed Total Errors Searched Number Extents Blocks File Busy Not Free au 0 2467 0 0 0 0 0 au 1 0 0 0 0 0 0 au 2 0 0 0 0 0 0 au 3 0 0 0 0 0 0 au 4 0 0 0 0 0 0 total 2467 0 0 0 0 0 In Proper Location Moved to Proper Location



Extents Blocks Extents Blocks au 0 2173 19409 235 4984 au 1 0 0 0 0 au 2 0 0 0 0 au 3 0 0 0 0 au 4 0 0 0 0 total 2173 19409 235 4984 Moved to Free Area In Free Area Could not be Moved Extents Blocks Extents Blocks Extents Blocks au 0 0 0 0 0 0 0 au 1 0 0 0 0 0 0 au 2 0 0 0 0 0 0 au 3 0 0 0 0 0 0 au 4 0 0 0 0 0 0 total 0 0 0 0 0 0 Note that the default five passes were scheduled, but the reorganization finished in two passes. This file system had not had much activity since the last reorganization, with the result that little reorganization was required. The time it takes to complete extent reorganization varies, depending on fragmentation and disk speeds. However, in general, extent reorganization may be expected to take approximately one minute for every 10 megabytes of disk space used. In the preceding example, the Extents Searched column contains the total number of extents examined. The Number column (located under the Consolidations Performed heading) contains the total number of consolidations or merging of extents performed. The Extents column (located under the Consolidations Performed heading) contains the total number of extents that were consolidated. (More than one extent may be consolidated in one operation.) The Blocks column (located under the Consolidations Performed heading) contains the total number of blocks that were consolidated. The File Busy column (located under the Total Errors heading) contains the total number of reorganization requests that failed because the file was active during reorganization. The Not Free column (located under the Total Errors heading) contains the total number of reorganization requests that failed because an extent that the reorganizer expected to be free was allocated at some time during the reorganization. The In Proper Location column contains the total extents and blocks that were already in the proper location at the start of the pass. The Moved to Proper Location column contains the total extents and blocks that were moved to the proper location during the pass. The Moved to Free Area column contains the total number of extents and blocks that were moved into a convenient free area in order to free up space designated as the proper location for an extent in the allocation unit being reorganized. The In Free Area column contains the total number of extents and blocks that were in areas designated as free areas at the beginning of the pass. The Could not be Moved column contains the total number of extents and blocks that were in an undesirable location and could not be moved. This occurs when there is not



10-49

enough free space to allow sufficient extent movement to take place. This often occurs on the first few passes for an allocation unit if a large amount of reorganization needs to be performed. If the next to last pass of the reorganization run indicates extents that cannot be moved, then the reorganization fails. A failed reorganization may leave the file system badly fragmented, since free areas are used when trying to free up reserved locations. To lessen this fragmentation, extents are not moved into the free areas on the final two passes of the extent reorganizer and the last pass of the extent reorganizer only consolidates free space. To defragment a BaseJFS you need to perform the same steps you would for an HFS: 1. Back up the file system (with fbackup). 2. Make a new file system (with newfs). 3. Restore the data from tape (with frecover).



10–18. SLIDE: Using setext

Student Notes setext specifies a fixed extent size for a file, and reserves space for a file. The file must already exist.

setext [-F vxfs] [-e extentsize] [-r reservation] [[-f flag]... ] file Options: -e extentsize Specify fixed extent size (in file system blocks) -r reservation Pre-allocate space (in file system blocks) -f align All extents aligned on extentsize boundaries relative the start of

allocation units -f contig Reservation must be allocated contiguously -f noextend File may not be extended once pre-allocated space has been used -f chgsize Reservation is incorporated into file; on-disk inode updated with size

and block count information that includes the reserved space

Using setext

The setext command can manipulate the extent allocation policies of the JFS file system on a file by file basis:

• Use setext to override default VxFS extent allocation policies• Specify the extent size• Force files to be continuous• Pre-reserve space for future continuous growth• Prevent files from growing past reservation• Use getext to view file parameters• Use ls –le to view extent parameters



10-51

-f noreserve Reservation made as non-persistent allocation to file; on-disk inode not updated; associated with file until last close, then trimmed to current file size

-f trim Reservation is trimmed to current file size upon last close by all processes that have the file open

Example using setext # touch bigfile.0 bigfile.1 bigfile.2 # /usr/sbin/setext -F vxfs -r 4096 -f contig bigfile.1 # /usr/sbin/setext -F vxfs -f align -e 128 bigfile.2 # cp bigfile bigfile.0 # cp bigfile bigfile.1 # cp bigfile bigfile.2 # ls -l bigfile* # ls -l bigfile* -rw-r--r-- 1 root other 2691000 Nov 2 10:52 bigfile.0 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.1 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.2 # /usr/sbin/getext -F vxfs bigfile.* bigfile.0: Bsize 1024 Reserve 0 Extent Size 0 bigfile.1: Bsize 1024 Reserve 4096 Extent Size 0 bigfile.2: Bsize 1024 Reserve 0 Extent Size 128 Example output from ls -le # ls -le bigfile* -rw-r--r-- 1 root other 2691000 Nov 2 10:52 bigfile.0 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.1 :res 4096 ext 0 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.2 :res 0 ext 128



10–19. SLIDE: I/O Tunable Parameters

Student Notes

JFS Tunable Parameters

I/O Parameter Description read_pref_io The preferred read request size. The file system uses this in

conjunction with the read_nstream value to determine how much data to read-ahead. Default value is 64K.

write_pref_io The preferred write request size. The file system uses this in

conjunction with the write_nstream value to determine how to do flush-behind on writes. Default value is 64K.

read_nstream The number of parallel read requests of size read_pref_io to have

outstanding at one time. The file system uses the product of read_nstream multiplied by read_pref_io to determine its read-ahead size. Default value for read_nstream is 1.

I/O Tunable Parameters

• JFS provides a set of eleven (11) tunable I/O parameters.• If the default I/O parameters are not acceptable, then the /etc/vx/tunefstab file can be used.

• mount_vxfs(1M) invokes the vxtunefs(1M)command to process the contents of the /etc/vx/tunefstab file.

• Failure to set I/O parameters does not prevent mount from occurring



10-53

write_nstream The number of parallel write requests of size write_pref_io to have outstanding at one time. The file system uses the product of write_nstream multiplied by write_pref_io to determine when to do flush-behind on writes. Default value for write_nstream is 1.

(Only the first four parameters are described here. Refer to man page for vxtunefs(1m) for the remainder.)



10–20. SLIDE: vxtunefs Command for Tuning VxFS

Student Notes The slide shows the output of the vxtunefs command being used to query the configuration of a VxFS file system.

vxtunefs Command Details

/sbin/vxtunefs [-ps] [-f tunefstab] [-o parameter=value] [{mount_point|block_special}]...

Options: -f filename Use filename instead of /etc/vx/tunefstab as the file

containing tuning parameters. -o parameter=value Specify parameters for the file systems listed on the command line.

The parameters are listed below. -p Print the tuning parameters for all the file systems specified on the

command line.

vxtunefs Command for Tuning VxFS

# vxtunefs /tondirFilesystem i/o parameters for /tondirread_pref_io = 65536 # Preferred read request size is 64kread_nstream = 1 # Desired number of parallel read_pref_io’sread_unit_io = 65536write_pref_io = 65536 # Preferred write request size 64kwrite_nstream = 1 # Desired number of parallel write_pref_io’swrite_unit_io = 65536pref_strength = 10buf_breakup_size = 131072discovered_direct_iosz = 262144 # Large I/O’s treated like direct for speedmax_direct_iosz = 131072default_indir_size = 8192qio_cache_enable = 0max_diskq = 1048576initial_extent_size = 8max_seqio_extent_size = 2048max_buf_data_size = 8192



10-55

-s Set the new tuning parameters for the VxFS file systems specified on the command line or in the tunefstab file.

vxtunefs sets or prints tunable I/O parameters of mounted file systems. vxtunefs can set parameters describing the I/O properties of the underlying device, parameters to indicate when to treat an I/O as direct I/O, or parameters to control the extent allocation policy for the specified file system. With no options specified, vxtunefs prints the existing VxFS parameters for the specified file systems. vxtunefs works on a list of mount points specified on the command line, or all the mounted file systems listed in the tunefstab file. The default tunefstab file is /etc/vx/tunefstab. You can change the default using the -f option. vxtunefs can be run at any time on a mounted file system, and all parameter changes take immediate effect. Parameters specified on the command line override parameters listed in the tunefstab file. If /etc/vx/tunefstab exists, the VxFS-specific mount command invokes vxtunefs to set device parameters from /etc/vx/tunefstab.



10–21. SLIDE: /etc/vx/tunefstab Configuration

Student Notes The tunefstab file contains tuning parameters for VxFS file systems. vxtunefs sets the tuning parameters for mounted file systems by processing command line options or by reading parameters in the tunefstab file. Each entry in tunefstab is a line of fields in one of the following formats:

block-device tunefs-options system-default tunefs-options

block-device is the name of the device on which the file system exists. If there is more than one line that specifies options for a device, each line is processed and the options are set in order. In place of block-device, system-default specifies tunables for each device to process. If there is an entry for both a block device and a system default, the system default value takes precedence. Lines in tunefstab that start with the pound (#) character are treated as comments and ignored.

/etc/vx/tunefstab Configuration

• File is read every time a VxFS is mounted.

• Automatic permanent vxtunefs options implemented here

• File format as follows:

block-device tunefs-optionssystem-default tunefs-options

• Options set for individual file systems or globally all VxFS file systems



10-57

The tunefs-options correspond to the tuneable parameters that vxtunefs and mount_vxfs set on the file system. Each option in this list is a name=value pair. Separate the options by commas, with no spaces or tabs between the options and commas. See the vxtunefs(1M) manual page for a description of the supported options.

Examples

If you have a four column striped volume, /dev/vg01/lvol3, with a stripe unit size of 128 kilobytes per disk, set the read_pref_io and read_nstream parameters 128 and four, respectively. You can do this in two ways:

/dev/vg01/lvol3 read_pref_io=128k,read_nstream=4 or:

/dev/vg01/lvol3 read_pref_io=128k /dev/vg01/lvol3 read_nstream=4

To set the discovered direct I/O size so that it is always lower than the default, add the following line to the /etc/vx/tunefstab file:

/dev/dsk/c3t1d0 discovered_direct_iosz=128K



10–22. SLIDE: Taking Snapshots and Performance

Student Notes

Performance of the “Advanced” (Snapped) File System.

The write performance of the online (snapped) file system will be degraded but the read performance will stay the same. It is important to ensure that the snapshot file system (the backup) resides on a different physical disk, otherwise backup I/O will use up valuable bandwidth. Initial writes to a block after the snapshot is started will be 2 to 3 times slower. 1. Read the old data 2. Write the old data to the snapshot 3. Write of the new data Multiple snapshots would cause this process to be even slower. Only the initial write suffers, subsequent changes are not recorded in the snapshot and therefore would proceed at normal speed.

Taking Snapshots and Performance

• Issues for the “online” Snapped File System:– Read performance should not be affected.– Any writes after the snap will be 2-3 times slower– Subsequent writes to the same area will perform normally– Have the snapshot on a separate physical disk– Tests of OLTP show 15-20% degradation

• Issues for the “backup” Snapshot File System– Snapshot performance should be equivalent to normal JFS– Read performance suffers if Snapped (online) half is busy



10-59

Overall impact will depend on the read to write ratio and the mixing of the I/O operations. For example, Oracle running an OLTP workload on a snapped file system was measured about 15 to 20% slower than a file system that was not being snapped.

Performance of the “Backup” (Snapshot) File System.

Performance of the snapshot is maximized at the expense of writes to the snapped file system. Reads from a snapshot file system will typically be at the same rate as from a normal JFS file system, allowing backups to proceed at the full speed of JFS. Reads from the snapshot are impacted if the snapped file system is very busy. Remember the read data comes from the snapped file system unless it has been modified.



10–23. LAB: JFS File System Tuning

Directions The following lab exercise compares performance of JFS with different mount options. The mount options used with JFS can have a big impact on JFS performance. 1. Mount a JFS file system to be used for this lab under /vxfs.

# mount /dev/vg00/vxfs /vxfs 2. Because the above mount command specified no special mount options, the default

mount options are used. Use the mount -v command to view the default options, including the option for transaction logging type.

What type of transaction logging does JFS use by default?

3. Change directory to /vxfs. Time the execution of the disk_long program, which

writes 400 MB of data to the file system in 20 MB increments. After each 20 MB is written, the files are deleted. Run the command three times and record the middle results.

# cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long

Record middle results: Real: _____________ User: ____________ Sys: ____________

4. Remount the JFS file system using delaylog option. This helps performance of

noncritical transactions. Run the command three times and record the middle results. # cd / # umount /vxfs # mount -o delaylog /dev/vg00/vxfs /vxfs # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long




10-61

Based on the results, does the disk_long program perform many noncritical transactions?

5. Remount the JFS file system using tmplog option. This causes the system call to return

after the JFS transaction is updated in memory (step 1 from lecture), and before the transaction is written to the intent log. Run the command three times and record the middle results. # cd / # umount /vxfs # mount -o tmplog /dev/vg00/vxfs /vxfs # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long

Record middle results: Real: _____________ User: ____________ Sys: ____________ Based on the results, why does the disk_long program show little improvement when mounted with tmplog?

6. Remount the JFS file system using tmpcache option. This allows the JFS transaction to

be created without having to wait for the user data to be written in extending write calls. Run the command three times and record the middle results.

# cd / # umount /vxfs # mount -o mincache=tmpcache /dev/vg00/vxfs /vxfs # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long




7. Remount the JFS file system using direct option. This option requires all user data and all JFS transactions to bypass the buffer cache and go directly to disk. Run the command just once and record the results.

# cd / # umount /vxfs # mount -o mincache=direct /dev/vg00/vxfs /vxfs # cd /jfs # timex ./disk_long Record results: Real: _____________ User: ____________ Sys: ____________ Based on the results, why does the disk_long program show poor performance results when mounted with mincache=direct? When would this option be appropriate to use?

8. Dismount the VxFS file system. # umount /vxfs

http://education.hp.com H4262S C.00 2004 Hewlett-Packard Develoment Company, L.P. 11-1


Objectives


• List factors directly related to network performance.

• Describe how to determine network workloads (server and client).

• Evaluate UDP and TCP transport options.

• Identify a network bottleneck.

• List possible solutions for a network performance problem.


H4262S C.00 11-2 http://education.hp.com © 2004 Hewlett-Packard Develoment Company, L.P.

11–1. SLIDE: The OSI Model

Student Notes Networking allows one computer (server) to communicate with and share its local files and directories with other computers (clients) in a homogeneous environment.

Network Protocols

• NFSD, MOUNTD, FTPD, TELNETD The Networking server daemons respond to requests from clients and perform the requested operations.

• BIOD, FTP, TELNET The Networking user applications request operations to be performed for them on the server.

• XDR External data representation is a machine-independent data format used by applications to translate machine-dependent data formats to a universal format that can be used by other networking hosts using XDR.

The OSI Model

Application

Presentation

Session

Transport

IP

Data Link

Physical

Applicationnfsd, mountdtelnetd, ftpd

XDR

RPC

UDP/TCPIP

Data Link

Physical

ftp, biod,telnet

XDR

RPC

UDP/TCPIP

Data Link

Physical

Server Client

Presentation

Session

Transport

IP

Data Link

Physical


http://education.hp.com H4262S C.00 2004 Hewlett-Packard Develoment Company, L.P.

11-3

• RPC/Session Layer The remote procedure call mechanism allows a server machine to define a procedure that a client program can call. This is how a client can perform file system operations, such as creating, deleting, modifying, and viewing a directory, creating, deleting, modifying, and copying a file, and so on.

• UDP/TCP Network protocols that efficiently move large amounts of data. Because there is no acknowledgement from the receiver, UDP is considered unreliable, whereas TCP is considered reliable. However, TCP generally has more overhead and therefore does not perform as well as UDP.

• IP Internet protocol is a network protocol which is responsible for getting packets between hosts on one or more networks that are linked together.

• Data Link The data link defines how the packets are assembled on the physical wire. Examples of data link protocols include IEEE 802.3 (CSMA/CD), IEEE 802.4 (Token Bus), IEEE 802.5 (Token Ring).

• Physical The Physical layer describes the actual transfer media, and how data is transferred on the network. Examples of physical layer protocols include Twisted Pair, Coaxial, and Fiber Optics.



11–2. SLIDE: NFS Read/Write Data Flow

Student Notes As a prime example of how network performance can affect applications, let’s look at how NFS works. The above slide shows a high level overview of the sequence of events which occur when an NFS client attempts to access data on an NFS server: 1. A user process issues the read() system call against an NFS mounted file system. The

user process goes into a wait state, waiting for the system call to return. 2. Upon checking the buffer cache for the requested data (assume data is not in the buffer

cache), the biod daemon immediately follows the original read with a read-ahead request. This is done by biod so subsequent I/O requests have a better chance of being satisfied through the buffer cache.

3. The NFS subsystem within the kernel on the client issues an RPC read request on behalf

of the process (and a second on behalf of biod) to the NFS server. 4. The NFS server receives the request and schedules an nfsd process to handle it.

NFS Read/Write Data Flow

mount server:/data /data

Memory - Client

biodprocess

File

data

/kernel NFS

Memory - Server

nfsd

kernel NFSdata

/

Exported NFSFile System

Buffer Cache Buffer Cache

1

2

3

4

5

6

8

7



11-5

5. The nfsd daemon performs the file system read and the data is returned to the nfsd daemon through the servers' buffer cache.

6. The NFS subsystem within the kernel on the server schedules a reply to the client containing the requested data.

7. The data is returned to the client process through the buffer cache on the client. The data,

plus the data read ahead by the biod daemon, is stored in both the client's and server's buffer caches to allow future I/O requests to come from the buffer caches.

8. The read system call is returned (along with the data) to the client process.

As you can see, NFS initiates a fair amount of traffic over the network. Other services, such as telnet and ftp, have their own performance profiles. Some are interactive and response time is important. Others are task-oriented and rely mostly on throughput.



11–3. SLIDE: NFS on HP-UX with UDP

Student Notes NFS packets come into the NFS server through the UDP receive queue (port 2049). The size of this queue is 256 KB. The NFS packets are processed sequentially, FIFO. Upon receipt of an NFS packet, an nfsd daemon is awakened, removes the request from the queue, and processes the request. If requests come into the server faster than the daemons can process them, the UDP queue quickly begins to back up with requests. If the UDP queue is full when a new request arrives, the new request is dropped off the back of the queue. This is known as a UDP socket overflow. To prevent this, always have a sufficient number of daemons running. Regardless of how many nfsd daemons are running, only one will be awakened for each incoming request. This allows a site to meet the demands of peak workload without suffering performance problems during periods of light demand. NFS tuning can thus focus on file system and network performance than CPU performance, since the number of nfsd daemons does not impact performance.

NFS on HP-UX with UDP

• NFS packets arrive in the UDP socket (port 2049).

• The UDP socket is a 256-KB FIFO queue.

• The UDP socket is emptied by the nfsds.

• Not enough nfsds cause NFS packets to be backed up in the queue.

File

data

/

Memory - Server

kernel NFS

Exported NFSFile Systemnfsd

nfsdnfsd

nfsd

port 2048port 2049port 2050



11-7

11–4. SLIDE: NFS on HP-UX with TCP

Student Notes Network File System (NFS) is now supported over the connection-oriented protocol, TCP/IP for NFS versions 2 and 3, in addition to running over User Datagram Protocol (UDP). TCP transport increases dependability on wide-area networks (WANs). Generally, packets are successfully delivered more consistently because TCP provides congestion control and error recovery. As a result, with this new functionality, NFS is now supported over WANs. As long as TCP is supported on the WAN, then NFS is supported also. The mount_nfs command now supports a proto= option on the command line where the value for proto can be either UDP or TCP. (In the past, this option was ignored.) This change allows the administrator to specify which transport protocol they wish to use when mounting a remote file system. If the proto= option is not specified, by default, NFS will attempt a TCP connection. If that fails, it will then try a UDP connection. Thus, by default, you will begin using TCP instead of UDP for NFS traffic when you begin using the 11i version of HP-UX. This should have little impact on you. You do, however, have the option to specify either UDP or TCP connections.

NFS on HP-UX with TCP

• 16 nfsd processes are started by default.

• Multiple nfsds respond to udpqueue

• Single multi thread nfsktcpdprocesses is dedicated to tcp

• Client establishes udp/tcpmethod on mounting.

File

data

/

Memory - Server

kernel NFS

Exported NFSFile System

port 2049

nfsdnfsd

nfsdnfsd

nfsktcpd

TCP Socket



If you specify a proto= option, only the specified protocol will be attempted. If the server does not support the specified protocol, the mount will fail. nfsd now opens TCP transport endpoints to receive incoming TCP requests. For TCP, the nfsktcpd is multi-threaded. For UDP, the nfsd is still multi-processed. Kernel TCP threads execute under the process nfsktcpd. When counting the number of nfsd processes, keep in mind the following algorithm: An equal number of nfsd's that support UDP will be created per processor and only one nfsd that supports TCP will be created. In the case of a four-way machine and NUM_NFSDS=4 (set in /etc/rc.config.d/nfsconf), 17 nfsd's will be created: 16 for UDP (4 per processor) and 1 for TCP. nfsstat will now report TCP RPC statistics for both client and server. The TCP statistics will be under the connection-oriented tag and the UDP statistics will be under the connectionless-oriented tag.



11-9

11–5. SLIDE: biod on Client

Student Notes The biod daemons allow performance on the NFS client to maintain the illusion of having file systems on the local disks. The biod daemons assist in improving NFS client performance by performing read-aheads and write-behinds for the client processes.

Read-Ahead Requests

The biod daemons help read performance on NFS clients by reading ahead (that is, prefetching) data into the buffer cache so that when the client needs the data, it will be in its buffer cache. When an NFS client initiates a read request, and the data is not in its local buffer cache, the process performs the RPC read, itself. To prefetch data for the buffer cache, the kernel has the biod daemons send additional RPC read requests to the NFS server, just as if the NFS client process had requested this data. Subsequent read requests by the client (especially if reading sequentially) will find the data already in the buffer cache.

biod on Client

Memory - Client

biod

process

kernel NFS

Buffer Cache

Readrequest

Read-aheadrequest

read()

read()2

1

Memory - Client

biod

process

kernel NFS

Buffer Cache

Writerequests

write()write()write()write()

write()

2 1

biod

write()



Write-Behind Requests

The biod daemons assist in write performance by allowing the NFS client process performing the write() call to return immediately rather than waiting for the write() call to complete. When an NFS client performs a write() call, the data is written to the client's buffer cache. Once the data is in the buffer cache, the kernel schedules an RPC write to occur. If there are available biod daemons, the kernel can schedule the write to occur for the biod daemons, rather than the NFS client process. This allows the client process to continue its execution without having to wait for the write() call to return. Instead of the client process waiting for the write call, the biod daemon waits for the write call. NOTE: Without any biod daemons on the client, NFS still works. The difference is no

read-aheads are done, causing NFS read performance to suffer. All NFS clients performing writes are forced to wait for the RPC write requests to return, causing NFS write performance to suffer.



11-11

11–6. SLIDE: TELNET

Student Notes

Telnet also uses sockets. A socket is simply a system-port pair. A connection is a pair of sockets. On the client (when the user enters the telnet command), a port is assigned to the process from a pool of available ports. Thus a socket is formed on the client. A connection is established between that port and port 23 on the server (used exclusively to handle incoming telnet requests). On the server (as a result of the connection), a telnetd daemon is spawned and linked to port 23. Now, the telnet process running on the client (1) issues a request to execute some command on the server. The command is placed in a packet and sent through the socket on the client (2) to the socket on the server (3). The command is removed from the packet and given to the telnetd daemon (4) to execute. The telnetd daemon executes the command and places the result in a packet. That packet is sent through the socket on the server (5) to the socket on the client (6). The results are removed from the packet and sent to the telnet process (7).

TELNET

Memory - Client

telnet

kernel

Memory - Server

telnetd

kernel

14

port port

23?2 3

5

6

7



By default, telnet uses TCP for its transfers, since it needs to establish a firm connection between the client process and the server daemon.



11-13

11–7. SLIDE: FTP

Student Notes

FTP also uses sockets. It uses a pair of connections to perform all its operations – one connection passes the commands and their results back and forth while the other connection passes file data back and forth. On the client (when the user enters the ftp command), a port is assigned to the process from a pool of available ports. Thus a socket is formed on the client. A connection is established between that port and port 21 on the server (used exclusively to handle incoming ftp requests). On the server (as a result of the connection), a ftpd daemon is spawned and linked to port 21. Now, the ftp process running on the client (1) issues a request to execute some command on the server. The command is placed in a packet and sent through the socket on the client (2) to the socket on the server (3). The command is removed from the packet and given to the ftpd daemon (4) to execute.

FTP

Memory - Client

ftp

kernel

Memory - Server

ftpd

kernel

1

ports ports

20/21?/?

2 3 56

711

10 12 9 13

4 814



The ftpd daemon executes the command and places the result in a packet. That packet is sent through the socket on the server (5) to the socket on the client (6). The results are removed from the packet and sent to the ftp process (7). If the command involves the transfer of some file data, the ftp process on the client (or the ftpd daemon on the server) initiates the transfer of the data from one socket to the other – using port 20 on the server and another available port on the client. For example, let’s say that the user entered the ftp command:

get /etc/hosts /tmp/hosts

When the command arrives at the ftpd daemon, it triggers a read of the /etc/hosts file from the server’s file system into the server’s buffer cache. Once there, the daemon (8) places the contents of the file into one or more packets (as necessary) and sends them to port 20 (9). The packets arrive at the socket in the client (10) and are reassembled into the image of the file in a network buffer. Then it is copied into a buffer in buffer cache. The ftp process (11) acknowledges the receipt of the file by sending a packet to the socket (12) across the network to the socket on the server (13), where it is extracted and sent on to the daemon (14). By default, ftp uses TCP for its transfers, since it needs to establish two firm connections between the client process and the server daemon.



11-15

11–8. SLIDE: Metrics to Monitor — NFS

Student Notes

Number of nfsd daemons

Too few nfsd daemons can hinder performance on the NFS server. If all the nfsd daemons are busy when new NFS requests come in, then the requests have to wait until one of the daemons become free.

Monitor Ratio between calls and nfsd daemons

It is important to monitor the total RPC traffic (represented by the RPC calls field) relative to the NFS traffic (represented by the nfsdrun or NFS calls field). This can be especially helpful when there are multiple RPC-based applications (for example, NIS) running on the same system. NOTE: The nfsdrun field is no longer present on the HP-UX 11.00 release, due to

differences in how nfsd daemons are run in the 10.x release. To monitor the ratio of RPC to NFS traffic on HP-UX 11.00, use the calls fields.

Metrics to Monitor — NFS

• Number of nfsd daemons:- Monitor ratio between calls- Monitor CPU time used by all nfsd daemons

• Number of biod daemons:- Monitor number of waits due to no biods available- Monitor CPU time used by all biod daemons

• Number of badcalls• Number of read and write NFS calls• Number of UDP socket overflows, timeouts, retransmissions,

and late responses



Monitor CPU Time Used by All nfsd and biod Daemons

Distribution of NFS requests is spread over all nfsd daemons. This means each daemon is scheduled sequentially as NFS requests arrive. A sample CPU distribution on HP-UX would look something like:

CPU Utilization in minutes: 10 | 9 | 8 | 7 | 6 | 5 | X X 4 | X X X X 3 | X X X X 2 | X X X X 1 | X X X X 0 |_____X_________X_________X_________X__ NFSDs 1 2 3 4

While the scheduling algorithm evenly balances the NFS call load across all nfsd daemons, it makes it difficult to determine if enough nfsd daemons are running on the server.

The number of biod daemons

Too many biod daemons can cause NFS servers to be flooded with requests, since each biod daemon can have an outstanding NFS request pending. Increasing the number of biod daemons on a client increases the number of NFS requests the client can have pending. Too few biod daemons could mean an NFS request has to be performed by the client process itself (which means it has to wait) because no biod daemons are available. When the client process performs an RPC, the wait field (from the nfsstat -c command) is incremented by one. Too many waits indicate not enough biod daemons.

Number of Read/Write NFS Calls

The NFS read and NFS write RPC calls are the most resource-intensive of the NFS RPCs. Monitoring the percentage and quantities of these calls helps to give an indication of the total load these calls are placing on the NFS server.

Number of nullrecv

If the nfsd daemons are not being kept busy this counter will be incremented. If this counter is incrementing try reducing the number of nfsd daemons on the system until nullrecv is static.

Use netstat -p udp to View the Number of UDP Socket Overflows

UDP socket overflows can occur when too many NFS clients are sending requests to the NFS server, and too few nfsd daemons are running to handle the requests. When all the nfsd daemons are servicing RPC requests, none of them can read a new request from the UDP socket. Incoming RPCs are queued until the UDP socket structure becomes full. If the socket queue is full when a new request arrives, a UDP socket overflow condition occurs.



11-17

Number of badcalls

Bad calls indicate that the NFS server cannot process RPC requests. This could be due to authentication problems caused by having a user in too many groups, attempts to access exported file systems as root, or an improper secure RPC configuration. This can also be due to the server being down, or soft-mounted NFS file systems timing out.

Number of Time-Outs, Retransmissions, and Late Responses

A time-out indicates that the RPC call did not complete within the expected time period. The late responses (also known as badxid) refer to the NFS server responding to the client after the time period has expired. If time-outs and late responses are approximately equal, it indicates a healthy network, but an overloaded NFS server. If time-outs are high and late responses are low (or zero), it indicates packets are never making it to the server, and the network components (interface cards, cables, hubs) need to be examined.



11–9. SLIDE: Metrics to Monitor — Network

Student Notes Some key metrics to monitor from an overall network perspective include:

• Amount of Traffic. The amount of network traffic should be monitored across the entire LAN. However, unless network probes are available, this is very difficult to do. At a minimum, the amount of traffic into and out of the servers should be monitored with the netstat command.

When monitoring network traffic, it is important to know the maximum packets per second on the LAN. In the case of a 10 MB Ethernet, this would be:

10 MB / 8 bits_per_byte = 1.2 MB per second (Total MB per second) 1.2 MB / 1 KB_average_packet_size = 1,200 packets (Total packet per second) 1,200 * 30%_saturation_point = 360 packets (Max packets per second with

minimal collisions)

Metrics to Monitor — Network

• Amount of NFS traffic:– Monitor number of collisions– Monitor server workload– Monitor client workload

• Type of network topology and hardware

• Number of subnets

• Number of routers



11-19

• Type of Network Topology: Each network topology has different limitations. Ethernet is the most common, but it is the slowest. More recent Ethernet technologies are faster, offering 100 Mbits/sec or even 1000 Mbits/sec. FDDI is the fastest, but it is somewhat expensive. Token Ring has no collision issues (since it is token based), but it is not as pervasive.

• Number of Subnets: Subnetting is a method for localizing traffic to help reduce packet congestion. If too much traffic exists on a network, it may need to be split into multiple subnets.

• Number of Routers: Routers are another possible solution to help segment network traffic. In addition, routers can help with network security issues and routing of diverse packet types.



11–10. SLIDE: Determining the NFS Workload

Student Notes The NFS workload on a server is defined as the total number of NFS packets received and processed. The NFS workload on a client is defined as the total number of NFS requests initiated from the client. It is important to establish a baseline regarding the NFS workload being placed on an NFS machine. This allows the system administrator to determine periods when the NFS workload is particularly high or low.

Sample Procedure for Calculating the NFS Workload

1. On Monday morning at 8:00 AM, run nfsstat -z. This zeroes out all the NFS registers. 2. On Friday evening at 5:00 PM, run nfsstat –rs (on the server) or nfsstat –rc (on

the client). This shows the total number of NFS calls. Sample outputs are:

Determining the NFS Workload

• Run nfsstat -s or netstat –c to view total RPC calls

• Each week, zero counters; at end of week, divide total RPC calls by 5 days, 8 hours per day, 60 mins per hour, 60 seconds per minute

• Write a script to automate the data collection, calculation, andnotification of the system administrator

• Set up a cron job to execute the script at the necessary times



11-21

# nfsstat -rs Server rpc: calls badcalls nullrecv badlen xdrcall nfsdrun 171792344 0 0 0 0 549734423 # nfsstat -rc Client rpc: Connection oriented: N/A Connectionless oriented: calls badcalls retrans badxid timeouts waits newcreds 17547240 0 0 240 360 0 0 badverfs timers toobig nomem cantsend buflocks 0 7 0 0 0 0

3. Calculate the average number of NFS calls per second by dividing the total RPC calls by 5 days, 8 hours per day, 60 minutes per hour, and 60 seconds per minute:

((((171792344 calls/ 5 days) / 8 hours) / 60 min) / 60 sec) = 1193 RPC calls/sec

Sample Script for Calculating an NFS Workload

A simple script for gathering and calculating the average number of NFS calls/second at the end of a week is shown below. Such a script could be called: usr/local/bin/create_nfs_report

#!/usr/bin/sh /usr/bin/nfsstat -rs | tail -1 | read calls trash NFS_CALLS_PER_SEC=$(echo "$calls / 5 / 8 / 60 / 60" | bc) HOST=$(hostname) echo "The average NFS server calls (inbound) received for $(date +%x) was $NFS_CALLS_PER_SECOND on $HOST" | mailx -s "NFS Report" root #!/usr/bin/sh /usr/bin/nfsstat -rc | tail -3 | read calls trash NFS_CALLS_PER_SEC=$(echo "$calls / 5 / 8 / 60 / 60" | bc) HOST=$(hostname) echo "The average NFS server calls (outbound) initiated for $(date +%x) was $NFS_CALLS_PER_SECOND on $HOST" | mailx -s "NFS Report” root

This report can be mailed to the performance analysis station or to the NFS machine.



Sample cron Entry for Automating the Procedure

To automate the process so that this happens every week without user invention, the following two entries can be placed in root's crontab file:

0 8 * * 1 /usr/sbin/nfsstat -z 0 17 * * 5 /usr/local/bin/create_nfs_report

This is a very simplistic form of data collection. Much more involved scripts can be developed. For example, scripts that take into account the time of day that demand is most heavy so that peak demand and demand patterns can be observed. Doing this on all NFS clients of an NFS server is also key. It is important to establish a baseline regarding the NFS workload being placed initiated from all NFS clients. This allows the system administrator to determine periods when the NFS workload is particularly high or low.



11-23

11–11. SLIDE: NFS Monitoring — nfsstat Output

Student Notes The nfsstat -s report shows NFS statistics on an NFS server. The report shows overall RPC statistics and detailed NFS type packets received.

Fields of Interest in This Report

calls (RPC) This is the total RPC calls received. This should be compared to the total NFS calls received. Analyze the ratio of RPC calls to NFS calls to determine the percentage of RPC calls that are NFS related.

nullrecv This is the number of times an NFS daemon (or other RPC daemon)

was scheduled to run only to find nothing in the UDP queue. This was very common on a 9.x system, since every time an NFS packet was placed in the UDP queue, all the nfsd daemons were awakened. The first nfsd daemon would take the NFS packet and the other daemons would find no packets in the UDP queue (incrementing the nullrecv field).

The example on the slide shows all the RPC packets received are NFS related. The six nullrecvs explain the difference between the RPC calls and NFS calls.

# nfsstat –s

Connection oriented:calls badcalls nullrecv badlen xdrcall dupchecks dupreqs0 0 0 0 0 0 0 Connectionless oriented:calls badcalls nullrecv badlen xdrcall dupchecks dupreqs428 0 6 0 0 0 0

NFS Monitoring — nfsstat Output

# nfsstat -c

Client rpc:Connection oriented:calls badcall badxids timeouts newcreds badverfs timers 0 0 0 0 0 0 0cantconn nomem interrupts 0 0 0Connectionless oriented:calls badcalls retrans badxids timeouts waits newcreds25345 304 1109 49 1410 0 0badverfs timers toobig nomem cantsend bufulocks0 16 0 0 0 0



The reason for the nullrecv may be due to a client retransmission duplicate request. For example, if a client sends an NFS read request and does not receive a response within its time-out period, it will re-send the same request, which causes a duplicate entry to be in the server's UDP queue. When the first nfsd daemon removes the first NFS read request, it will also remove the duplicate request. This causes the second nfsd daemon to find an empty UDP queue when it executes. nfsstat –s (Full Output) # nfsstat -s Server rpc: Connection oriented: calls badcalls nullrecv 0 0 0 badlen xdrcall dupchecks 0 0 0 dupreqs 0 Connectionless oriented: calls badcalls nullrecv 0 0 0 badlen xdrcall dupchecks 0 0 0 dupreqs 0 Server nfs: calls badcalls 0 0 Version 2: (0 calls) null getattr setattr 0 0% 0 0% 0 0% root lookup readlink 0 0% 0 0% 0 0% read wrcache write 0 0% 0 0% 0 0% create remove rename 0 0% 0 0% 0 0% link symlink mkdir 0 0% 0 0% 0 0% rmdir readdir statfs 0 0% 0 0% 0 0% Version 3: (0 calls) null getattr setattr 0 0% 0 0% 0 0% lookup access readlink 0 0% 0 0% 0 0% read write create 0 0% 0 0% 0 0% mkdir symlink mknod 0 0% 0 0% 0 0% remove rmdir rename 0 0% 0 0% 0 0% link readdir readdir+



11-25

0 0% 0 0% 0 0% fsstat fsinfo pathconf 0 0% 0 0% 0 0% commit 0 0% The nfsstat -c report shows NFS statistics on an NFS client. The report shows the amount of RPC calls generated by the client, as well as the specific NFS calls.

Fields of Interest in This Report

calls (RPC) This is the total RPC calls generated by the client. This should be monitored relative to the total NFS calls generated. Analyze the ratio of RPC calls to NFS calls generated to determine the percentage of RPC calls that are NFS related.

waits This is the number of times an NFS client process is put into a wait

state due to no biod daemons being available. An example of this would be during an NFS write. Normally, an NFS write is performed by a biod daemon, and the biod daemon waits for an acknowledgment to be returned by the NFS server. When no biod daemons are available, then the actual client process itself performs the NFS write and the client has to wait for the acknowledgment to be returned.

timeouts This is the number of times an NFS request was sent to the NFS server

and no response was returned within the timeout period. A timeout can occur for two reasons: the NFS server machine is too busy and cannot get back to the client within the timeout period, or the network is having problems (collisions, bad interface card, bad hub) and the NFS request is never making it to the NFS server.

badxids This indicates a bad or duplicate transfer ID number was returned

from the NFS server. When a client sends a request to the server, a corresponding transfer ID is sent with the requests. When the NFS server responds, it specifies which transfer ID request it is responding to.

For example, if a client sends an NFS request (with a corresponding transfer ID) and does not hear back from the NFS server, it then transmits the request a second time. If the NFS server returns the first and second requests after the client had timed out the first time, the client will view the second response as a duplicate transfer ID.

NOTE: The ratio of timeouts to badxids is an excellent way to determine if

timeouts are occurring due to a slow NFS server or due to a failed network component. If the badxids are approximately the same as timeouts, then the NFS server is slow and the timeout period should be increased. If there are a lot of timeouts with few to no badxids, then the NFS requests are not making it to the server and there is most likely a failed LAN component.



retrans This indicates the number of NFS requests retransmitted due to timeouts. Keep in mind, not every timeout causes a retransmission, as most clients error out after two to three retries.

badcalls This indicates an NFS request has reached its retry count and has

returned an error. This is most often due to the NFS client not being able to reach the NFS server (either because the NFS server is down, or the network link between the client and server is down).

nfsstat -c (Full output) # nfsstat -c Client rpc: Connection oriented: calls badcalls badxids 0 0 0 timeouts newcreds badverfs 0 0 0 timers cantconn nomem 0 0 0 interrupts 0 Connectionless oriented: calls badcalls retrans 55 0 0 badxids timeouts waits 0 0 0 newcreds badverfs timers 0 0 16 toobig nomem cantsend 0 0 0 bufulocks 0 Client nfs: calls badcalls clgets 55 0 55 cltoomany 0 Version 2: (55 calls) null getattr setattr 0 0% 50 90% 0 0% root lookup readlink 0 0% 3 5% 0 0% read wrcache write 0 0% 0 0% 0 0% create remove rename 0 0% 0 0% 0 0% link symlink mkdir 0 0% 0 0% 0 0% rmdir readdir statfs 0 0% 1 1% 1 1% Version 3: (0 calls) null getattr setattr 0 0% 0 0% 0 0%



11-27

lookup access readlink 0 0% 0 0% 0 0% read write create 0 0% 0 0% 0 0% mkdir symlink mknod 0 0% 0 0% 0 0% remove rmdir rename 0 0% 0 0% 0 0% link readdir readdir+ 0 0% 0 0% 0 0% fsstat fsinfo pathconf 0 0% 0 0% 0 0% commit 0 0%



11–12. SLIDE: Network Monitoring — lanadmin Output

Student Notes The lanadmin command displays general network packet transmission statistics for a single system.

Fields of Interests in this Report

Collisions Frames These fields indicate the number of collisions detected by the system. Collisions slow NFS performance, as the network has to subside before any packets can be sent following a collision.

Inbound/Outbound This is the total of all packet types being sent and received from

the Packets system. Compare this to the total number of daemon-related packets transmitted/received to obtain a ratio of total network traffic relative to the specific traffic.

The primary metric for determining if you have a network bottleneck is the ratio of collisions to out-bound packets. In this example, you would take the total number of collisions (6221 + 10151 = 16371) and divide it by the total number of outbound packets (2626023 + 1454 = 2627477) to get the percentage of collisions per outbound packet (16371 / 2627477 = 0.6%). The commonly used threshold is 5%. Any network experiencing greater than a 5% collision

Network Monitoring — lanadmin Output

Network Management ID = 4Description = lan0 Hewlett-Packard LAN Interface Hw Rev 0Type (value) = ethernet-csmacd(6)MTU Size = 1500Speed = 10000000Station Address = 0x800097bfb43Administration Status (value) = up(1)Operation Status (value) = up(1)Last Change = 4834Inbound Octets = 426550151Inbound Unicast Packets = 3380123Inbound Non-Unicast Packets = 1992200Inbound Discards = 0Inbound Errors = 1277Inbound Unknown Protocols = 53618Outbound Octets = 1653363768Outbound Unicast Packets = 2626023Outbound Non-Unicast Packets = 1454Outbound Discards = 1Outbound Errors = 0Outbound Queue Length = 0Specific = 655367

Press <Return> to continue

Ethernet-like Statistics Group

Index = 4Alignment Errors = 0FCS Errors = 0Single Collision Frames = 6221Multiple Collision Frames = 10151Deferred Transmissions = 116267Late Collisions = 0Excessive Collisions = 0Internal MAC Transmit Errors = 0Carrier Sense Errors = 0Frames Too Long = 0Internal MAC Receive Errors = 0

LAN Interface test mode. LAN Interface Net Mgmt ID = 4



11-29

rate is said to have a bottleneck. This system is well below that threshold. Of course, this metric only works on networks that experience collisions. Standard Ethernet does. Token rings do not. The procedure for producing this report is: 1. Execute the lanadmin command. 2. From the main menu, select lan. 3. From the lan menu, select display. Following is a complete output from this tool: # lanadmin LOCAL AREA NETWORK ONLINE ADMINISTRATION, Version 1.0 Thu, Mar 25,2004 11:22:51 Copyright 1994 Hewlett Packard Company. All rights are reserved. Test Selection mode. lan = LAN Interface Administration menu = Display this menu quit = Terminate the Administration terse = Do not display command menu verbose = Display command menu Enter command: lan LAN Interface test mode. LAN Interface PPA Number = 0 clear = Clear statistics registers display = Display LAN Interface status and statistics registers end = End LAN Interface Administration, return to Test Selection menu = Display this menu ppa = PPA Number of the LAN Interface quit = Terminate the Administration, return to shell reset = Reset LAN Interface to execute its selftest specific = Go to Driver specific menu Enter command: display LAN INTERFACE STATUS DISPLAY Thu, Mar 25,2004 11:23:02 PPA Number = 0 Description = lan0 HP PCI 10/100Base-TX Core [100BASE-TX,FD, AUTO,TT=1500] Type (value) = ethernet-csmacd(6) MTU Size = 1500 Speed = 100000000 Station Address = 0x306e48c545



Administration Status (value) = up(1) Operation Status (value) = up(1) Last Change = 780 Inbound Octets = 1144058672 Inbound Unicast Packets = 3513729 Inbound Non-Unicast Packets = 2575374 Inbound Discards = 0 Inbound Errors = 0 Inbound Unknown Protocols = 13895 Outbound Octets = 784916247 Outbound Unicast Packets = 3600289 Outbound Non-Unicast Packets = 379474 Outbound Discards = 0 Outbound Errors = 0 Outbound Queue Length = 0 Specific = 655367 Press <Return> to continue <CR> Ethernet-like Statistics Group Index = 1 Alignment Errors = 0 FCS Errors = 0 Single Collision Frames = 0 Multiple Collision Frames = 0 Deferred Transmissions = 0 Late Collisions = 0 Excessive Collisions = 0 Internal MAC Transmit Errors = 0 Carrier Sense Errors = 0 Frames Too Long = 0 Internal MAC Receive Errors = 0 LAN Interface test mode. LAN Interface PPA Number = 0 clear = Clear statistics registers display = Display LAN Interface status and statistics registers end = End LAN Interface Administration, return to Test Selection menu = Display this menu ppa = PPA Number of the LAN Interface quit = Terminate the Administration, return to shell reset = Reset LAN Interface to execute its selftest specific = Go to Driver specific menu Enter command: quit #



11-31

11–13. SLIDE: Network Monitoring — netstat –i Output

Student Notes The netstat command can be used to monitor total collisions and total packet traffic in and out of a LAN card, as well as any UDP socket overflows. The -i option monitors input packets, input errors, output packets, output errors, and collisions for every LAN card on the system. Some versions of this tool did not show the Input Errors, Output Errors, and the Collisions. The output of this tool can also be used to calculate the collision/outbound packet ratio described in the previous topic. The -p udp option monitors overflows related to the UDP socket queue. If there are not enough nfsd daemons, the volume of incoming client NFS requests can exceed the server's ability to drain these requests from the UDP socket queue. When the socket queue becomes full and new NFS requests are received, the NFS request falls off the queue and a UDP socket overflow occurs.

Network Monitoring — netstat -i Output

# netstat -iName Mtu Network Address Ipkts Ierrs Opkts Oerrs Colllan0 1500 156.153.208.0 r265c75.cup.edunet.hp.com

4546682 0 4138618 0 0 lo0 4136 loopback localhost 1178171 0 1178171 0 0

# netstat -p udpudp:

0 incomplete headers0 bad data length fields (Deleted from later versions)0 bad checksums0 socket overflows 0 data discards (Deleted from later versions)



11–14. SLIDE: glance — NFS Report

Student Notes The glance NFS report (the n key) monitors total inbound requests for NFS servers and total outbound requests for NFS clients. For NFS servers, the total number of inbound read/write requests received from each client is shown, along with the average amount of time for the server to service each request. For NFS client systems, the total number outbound read/write requests sent to each NFS server is shown, along with the average amount of time, from the client perspective, for the requests to be serviced. For a detailed inspection on the types of NFS requests being sent (client) or the types of NFS requests being received (server), the specific client or server can be selected with the S key.

glance — NFS Report


NFS BY SYSTEM Users= 13Server (inbound) Client (outbound)

Idx System ReadRt WriteRt SvcTm ReadRt WriteRt SvcTm NetwkTm--------------------------------------------------------------------------------

1 e2403roc 0.0 0.0 0.00 0.0 0.0 0.00 0.002 e2403sto 0.0 0.0 0.00 0.0 0.0 0.00 0.003 e2403alf 0.0 0.0 0.00 0.0 0.0 0.00 0.00

S - Select a System C - cum/interval toggle Page 1 of 1

S S R U U

F F

S S U U B B

U U R R



11-33

11–15. SLIDE: glance — NFS System Report

Student Notes The glance NFS system report (the N key) displays the activity of NFS packets that are being received by an NFS server, or being sent as an NFS client. If a system is both a client and a server, separate columns are maintained for each. Fields of most interest in this report are the read and write rates, as these typically put the greatest load on a system. Note that this is page one of three. On the following two pages, the individual RPCs are broken down by type and counted. There are version2 and version 3 counts to accommodate earlier and later versions of NFS.

glance — NFS System Report

B3692A GlancePlus B.10.12 10:45:26 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------Cpu Util |100% 100% 100%Disk Util | 83% 22% 84%Mem Util | 94% 95% 96%Swap Util | 21% 21% 22%--------------------------------------------------------------------------------NFS OPERATIONS for: e2403sto Address = 15.19.83.75 PID = 1275

NFS GLOBAL ACTIVITY Users= 1Server (inbound) Client (outbound)

Current Cum Current Cum--------------------------------------------------------------------------------Read Rate 0.0 0.0 0.0 0.0Write Rate 0.0 0.0 0.0 0.0Read Byte Rate 0.0 0.0 0.0 0.0Write Byte Rate 0.0 0.0 0.0 0.0NFS Call Count 0 0 0 0Bad Call Count 0 0 0 0Service Time 0.00 0.00 0.00 0.00Network Time na na 0.00 0.00Read/Write Qlen na na 0 0Idle biods na na 16 na

Page 1 of 3

S S R U U

F F

S S U U B B

U U R R



11–16. SLIDE: glance — Network by Interface Report

Student Notes The glance Netwrok by Interface report (the l key) displays the activity of inbound and outbound packets. Fields of most interest in this report are the inbound and outbound packets rates, as well as the KB transferred in and out by each network card. The lo0 interface is the internal loopback interface – used for diagnostics.

B3692A GlancePlus B.10.12 10:47:57 e2403roc 9000/856 Current Avg High--------------------------------------------------------------------------------Cpu Util |100% 100% 100%Disk Util | 83% 22% 84%Mem Util | 94% 95% 96%Swap Util | 21% 21% 22%--------------------------------------------------------------------------------Interval: 5 NETWORK BY INTERFACE Users= 2

Network In Packet Out Packet In KB Out KBIdx Interface Type Rate Rate Rate Rate--------------------------------------------------------------------------------

1 lan0 Lan 30.9/ 96.6 31.1/ 98.3 2.3/ 23.1 2.1/ 45.12 lo0 Loop na/ na na/ na na/ na na/ na

S - Select an Interface Page 1 of 1

S S R U U

F F

S S U U B B

U U R R

glance — Network by Interface Report



11-35

11–17. SLIDE: Tuning NFS

Student Notes There are a number of NFS tuning solutions that can help to improve performance on NFS servers:

• Tune number of nfsd daemons: The default number of nfsd daemons in HP-UX 11.00 and earlier was four. This most likely is too small. The best recommendation for performance is to have two nfsd daemons for each simultaneous disk operation that can be performed. This allows a request to be received, while another is awaiting disk service. For example, on a system with four SCSI controllers and NFS-exported file systems spanning disks on these controllers, schedule eight nfsd daemons. In 11i, the default number of nfsd was raised to 16. This seems a more reasonable number. The best indicator of too few nfsd daemons is UDP socket overflows. Increase the number of nfsd daemons s if even one UDP socket overflow occurs. The size of the UDP socket queue can be viewed with the netstat -an | grep udp |grep 2049 command. Another indicator of too few nfsd server daemons is a high total of badxids being returned to NFS clients. Remember, only UDP requires the number of nfsds to be tuned. TCP uses multiple threads in the same daemon.

Tuning NFS

• Tune number of nfsd daemons

• Turn on “sticky bit” for exported executables

• Export file systems with asynchronous write option

• Avoid using symbolic links on exported file systems

• Tune number of biod daemons

• Tune mount options when mounting NFS file system:– rsize and wsize options– retry and timeout options



• Turn on “sticky bit” for exported executables: By default, text segments are not paged to swap, as their pages already exist on the file system. In the case of an executable program being loaded from an NFS server across the network to an NFS client, it is desirable to page the text locally, rather than return to the NFS server when the text page is needed again. This behavior can be achieved by setting the sticky bit to ON for the executable program. Below is an example of setting the sticky bit to ON for an executable:

# chmod 1555 prgm # ls -l prgm -r-xr-xr-t 1 root bin 411089 Feb 3 1997 /opt/PGMS/bin/prgm

This also requires modifying the following tunable kernel parameter on the client:

page_text_to_local = 1

There are a number of NFS tuning solutions that can help to improve performance on NFS clients:

• Tune number of biod daemons: The default number of biod daemons in HP-UX 11.00 and earlier was four. This most likely is too small. The best recommendation is to have a minimum of two biod daemons for every client process performing I/O to and from the NFS file system. Each biod daemon has, at most, one NFS request outstanding at any time, and as the number of biod daemons increase, the more disk requests the client can send. If the client has x process performing file system I/O and y biod daemons, then the client could have x+y RPC requests outstanding at one time: one for each of the biod daemons, and one for each of the clients. In 11i, the default number of biod was raised to 16. This seems a more reasonable number. The best indicator of too few biod daemons is the number of waits shown in the netstat -c command.

• Tune the NFS Mount Options: There are a number of NFS mount options that can affect client performance, among them the NFS read and write buffer sizes. The NFS buffer size (specified with the rsize and wsize mount options) determines the increment in which data is transferred to and from the NFS file system. For example, if the file system block size is 8192 bytes, and the NFS buffer size was 8,500 bytes, two file system I/Os would be required before any NFS packet could be sent. The recommendation for NFS buffer size is to match the size of the file system block size. The default NFS buffer size is 8192 bytes, and this does match the default file system block size on HFS. For JFS, try to match the buffer size to the size of a typical extent size.



11-37

11–18. SLIDE: Tuning the Network

Student Notes Subnetting a network is an effective way to reduce congestion on a LAN. Using routers (as compared to bridges, Ethernet switches, and Ethernet to FDDI concentrators) provides a great deal of flexibility in the form of security, network segmentation, and routing of diverse types of packets. Routers usually provide good throughput and performance at a relatively low cost. Using an existing computer system as a gateway for traffic between NFS clients and the file server is often inefficient and limits the performance of the NFS clients. By making sure the maximum transmission unit MTU on the client system, the file server, and all routers in between them, the overhead on the routers caused by packet fragmentation and re-assembly can be avoided.

Tuning the Network

Ways to Reduce LAN Congestions

• Subnet the network• Use routers, not a computer for IP gateway• Use higher speed LAN technology:

– Fast Ethernet, FDDI, ATM• Increase the number of LAN adapters on the server• Put the server on an FDDI network and use routers to segment client

traffic• Put the server on an FDDI network and use switches to fan out to clients



Example

# netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ni0* 0 none none 0 0 0 0 0 ni1* 0 none none 0 0 0 0 0 lo0 4608 loopback localhost 6055 0 6055 0 0 lan0 1500 156.153.192 pr1w1 3724729 0 1705240 10 34739 lan1* 1500 none none 0 0 0 0 0

The route command (-p option) can be used to set the Path Mtu size for a host route only.



11-39

11–19. SLIDE: Tuning the Network (Continued)

Student Notes If the average client demand on an NFS server is measured to be greater than the network bandwidth, and assuming 100 clients demand 10 NFS requests per second, then a single 10-MB Ethernet segment (with a calculated 360 packets maximum per second) could not handle this workload, even though the server itself may be able to (from a processing stand point). To allow this client workload to be processed by the single NFS server, the following network configurations can be implemented: 1. Use at least three network interface cards, one for each segment, distributing 33-34

clients on a segment. 2. Use one or more high-speed network connections, which connect to multiple lower

bandwidth LAN segments. In the first example, we have added multiple LAN interfaces to our NFS server. In the second example, we have a 100-MB/second/ FDDI card on the NFS file server. We also have a router on the same segment as the server that has an FDDI interface, as well as

Tuning the Network (Continued)

Switch100 Mb/s

Router

Add LAN Interfaces to server

100 Mb/s

10 Mb/s

10 Mb/s

10 Mb/s

100 Mb/s10 Mb/s

10 Mb/s

10 Mb/s

Add Subnets

Use Ethernet Switches



several regular 10-MB/second Ethernet interfaces. Here, the issue of the routers ability to do packet fragmentation and reassembly efficiently, may become important. In our last example, we have a 100-MB/second FDDI card on the NFS file server and a 100-MB/second translating Ethernet switch on the same FDDI segment. Since this is not routed, the file server and clients share the same subnet address. There are many other possible network topologies.



11-41

11–20. LAB: Network Performance

Directions The following two labs investigate Network read and write performance. The labs use NFS and are performed against the JFS file system created in the JFS module.

Lab 1 Network Read Performance

To perform this lab, two systems are needed: an NFS server and an NFS client. Pair up with another student in the class for this lab.

1. Make sure the JFS file system on the NFS server contains the make_files program. Execute the make_files program to create files for the client to access.

# mount /dev/vg00/vxfs /vxfs # cp /home/h4262*/disk/lab1/make_files /vxfs # cd /vxfs # ./make_files

2. Export the JFS file system so the client can mount it.

# exportfs -i -o root=client_hostname /vxfs # exportfs

3. From the client system, mount the NFS file system.

# mount server_hostname:/vxfs /vxfs

4. Time how long it takes to read the 20 MB of files from the mounted file system. Record the results:

# timex cat /vxfs/file* > /dev/null

Record results:

Real: _____________ User: ____________ Sys: ____________

5. Now that the data is in the client's buffer cache, time how long it takes to read the exact same files again. Record the results:

# timex cat /vxfs/file* > /dev/null

Record results:

Real: _____________ User: ____________ Sys: ____________

Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help prefetching data.



6. Test to see if fewer biod daemons will change the initial performance.

# cd / # umount /vxfs # kill $(ps -e |grep biod|cut -c1-7) # /usr/sbin/biod 4 # mount server_hostname:/vxfs /vxfs # timex cat /vxfs/file* > /dev/null

Record results:

Real: _____________ User: ____________ Sys: ____________

7. Once finished, remove the files and umount the file system.

# rm /vxfs/file* # umount /vxfs



11-43

Lab 2 Network Write Performance

The following lab has the client perform many writes to an NFS file system. The following parameters will be investigated:

• Number of biod daemons

• NFS version 2 versus NFS version 3

• TCP versus UDP

During this lab, the monitoring tools shown below should be used on the client and server CLIENT SERVER # nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key)

- monitor Remote Rds/Wrts -

1. From the NFS client, mount the NFS file system as a version 2 file system.

# mount -o vers=2 server_hostname:/vxfs /vxfs

2. Terminate all the biod daemons on the client.

# kill $(ps -e |grep biod|cut -c1-7)

3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file.

# cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs

Record results:

Real: _____________ User: ____________ Sys: ____________

4. Now, start up the biod daemons, and retry timing the copy. Record the results:

# /usr/sbin/biod 4 # timex cp /stand/vmunix /jfs

Record results:

Real: _____________ User: ____________ Sys: ____________



5. Change the mount options to version 3 and retime the transfer:

# cd / # umount /vxfs # mount –o vers=3 server_hostname:/vxfs /vxfs # cd / # timex cp /stand/vmunix /vxfs

Record results:

Real: _____________ User: ____________ Sys: ____________

6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility.

# ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp

How long did the FTP transfer take? _________

Explain the difference in performance.

7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP.

# umount /vxfs # mount -o vers=3 –o proto=udp server_hostname:/vxfs /vxfs

Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP?

# timex cp /stand/vmunix /vxfs


12-1


Objectives


• Identify which tunable parameters belong to which category

• Identify tunable kernel parameters that could impact performance

• Tune both static and dynamic tunable parameters



12-2

12–1. SLIDE: Kernel Parameter Classes

Student Notes There are a number of tunable parameters within the kernel that can have a big impact on performance. When making changes to these parameters, it may require that a new kernel be compiled. As of 11i v1, about 12 parameters were converted to dynamically tunable parameters. That is, their values could be changed without rebuilding the kernel and without rebooting the system. As of 11i v2, there are now around 36 dynamically tunable parameters, plus a few traditional parameters that are now tuned by the kernel, so no manual tuning of them need be done at all. Static kernel parameters have been around since UNIX was first designed. In order to change one of these parameters, it was necessary to alter the contents of a system configuration file, system, rebuild the kernel using this altered configuration file, move the new kernel into place, and reboot the system to activate the new kernel. This tended to be time consuming and forced the system to become unavailable for a time. Recently, with HP-UX 11i v1, a few kernel parameters were converted to dynamic tuning. These parameters could be altered, using SAM or kmtune, and the changes would become effective immediately. There was no longer a need to rebuild the kernel or reboot the system. However, this only applied to those few kernel parameters. The vast majority of kernel parameters were still static. The dozen parameters that were made dynamically tunable,

Kernel Parameters Classes

• Static– requires a kernel rebuild and a reboot

• Dynamic– changes take place immediately– changes survive a reboot

• Automatic– constantly being tuned by the kernel– can be set manually to a fixed value



12-3

were ones that tended to be tuned by system administrators more frequently, but were relatively easy to convert to dynamic. More recently, with HP-UX 11i v2, several more parameters were converted to dynamic tuning. These parameters were also tuned fairly frequently by system administrators, but were more difficult to convert to dynamic. At the same time, a new class of parameters was introduced – automatic. These parameters were tuned by the kernel – constantly – in response to changing conditions in the system. However, the system administrator could override the automatic handing by the kernel and force the parameter to some fixed value, if needed. At HP-UX 11i v1, the following kernel parameters became dynamic:

core_addshmem_read core_addshmem_write maxfiles_lim maxtsiz maxtsiz_64bit maxuprc msgmax msgmnb scsi_max_qdepth semmsl shmmax shmseg

At HP-UX 11i v2, the following additional kernel parameters became dynamic:

aio_listio_max aio_max_ops aio_monitor_run_sec aio_prio_delta_max aio_proc_thread_pct aio_proc_threads aio_req_per_thread alloc_fs_swapmap alwaysdump dbc_max_pct dbc_min_pct dontdump fs_symlinks ksi_alloc_max max_acct_file_size max_thread_proc maxdsiz maxdsiz_64bit maxssiz maxssiz_64bit nfile nflocks nkthread



12-4

nproc nsysmap nsysmap64 physical_io_buffers shmmni vxfs_ifree_timelag

Also at HP-UX 11i v2, the following kernel parameters are obsolete or automatic:

bootspinlocks clicreservedmem maxswapchunks maxusers mesg ncallout netisr_priority nni ndilbuffers sema semmap shmem spread_UP_drivers



12-5

12–2. SLIDE: Tuning the Kernel

Student Notes Some general rules and notes regarding tuning and recompiling the kernel:

• View the existing, tunable parameters with the kctune command (HP-UX 11i v2), the kmtune command (HP-UX 11.00 and 11i v1) or the sysdef or system_prep commands (HP-UX 10.x). You can also use SAM with any version of HP-UX to view the current values. Examples of outputs are shown below.

• Use the System Administration Manager (SAM) to tune the kernel parameters and rebuild the systems. SAM has the advantage of displaying all available, tunable parameters, their current values, and a range of acceptable values. SAM also knows which parameters can be tuned dynamically and will make changes to them immediately. As of HP-UX 11i v2, SAM calls a separate utility to do the actual tuning.

• When tuning performance by modifying kernel parameters, modify only one value with each kernel rebuild. By changing several parameters at once, you may cloud the picture and make it much more difficult to determine what helped and what hurt the system’s performance.

Tuning the Kernel

• Use system_prep, kmtune, or kctune to view current values of tunable kernel parameters.

• Use SAM (or new km/kc commands) to tune kernel parameters.

• Tune only one parameter at a time.

• Do not make parameters unnecessarily large.

• Use glance to monitor system table sizes (ensure highest value is not equal to total table size).

• Some kernel parameters are dynamic (no reboot) see kmtune and kctune.



12-6

• Avoid setting the tunable parameters too large. Many of the parameters create in-core memory data structures whose size is dependent upon the value of the tunable parameter (for example, nprocs to the size of the process table). Generally, it is a good rule of thumb to increase or decrease a parameter by no more than 20%, while trying to find the best setting for it. Of course, if you are changing a parameter’s value to accommodate some new application you are installing, always follow the manufacturer’s suggested changes.

• Use glance to monitor system table sizes. Ensure the system tables are not running out of entries. In general, there should be around 20% of unused entries in any table. This will ensure that you have enough entries to handle any high demand periods.

The step-by-step procedure for tuning and recompiling the kernel manually on HP-UX 11.X is shown below:

1. Log in as superuser.

2. Change directory.

cd /stand/build

3. Create a system file from your current kernel. /usr/lbin/sysadm/system_prep -v -s system

4. Modify the /stand/build/system file as desired.

5. Build the kernel: /usr/sbin/mk_kernel -s system.

6. Save your old system and kernel files, just in case you want to go back.

cp /stand/system /stand/system.prev cp /stand/vmunix /stand/vmunix.prev cp /stand/dlkm /stand/dlkm_vmunix.prev

7. Schedule the kernel update on the next reboot.

kmupdate

8. Shut down and reboot from your new kernel.

/sbin/shutdown -ry 0

Understanding Dynamic Kernel Variables.

kctune(1M), kmtune(1M) or sam can be used “on the fly” to modify some kernel variables. Any changes take place immediately without the need to reboot. In HP-UX 11i v2, kmtune still exists, but simply calls kctune.



12-7

Example using kmtune to set and then activate a new value for a dynamic kernel variable.

# kmtune -q shmseg Parameter Current Dyn Planned Module Version ===================================================== shmseg 120 Y 120 # kmtune -s shmseg=155 # kmtune -l -q shmseg Parameter: shmseg Current: 120 Planned: 155 Default: 120 Minimum: - Module: - Version: - Dynamic: Yes # kmtune -u shmseg shmseg has been set to 155 (0x9b).



12-8

12–3. SLIDE: Kernel Parameter Categories

Student Notes The next few slides will present the tunable kernel parameters in these categories.

Kernel Parameter Categories

• File system• Message queues• Semaphores• Shared memory • Process• Swap• LVM• Networking• Miscellaneous



12-9

12–4. SLIDE: File System Kernel Parameters

Student Notes dbc_min_pct dbc_min_pct specifies the minimum size that the system's buffer

cache may shrink to as a percentage of physical memory. It is now dynamic in 11i v2.

dbc_max_pct dbc_max_pct specifies the maximum size that the system's buffer cache may grow to as a percentage of physical memory. It is now dynamic in 11i v2.

nbuf nbuf is used to specify the number of file system buffer cache headers. Set nbuf to zero if you want to use the system's ability to grow and shrink this important table dynamically, based on demand. It is not yet obsolete, but expect it to be so in a future release.

bufpages bufpages specifies the number of 4-KB pages in memory that will be allocated for the file system buffer cache. Like nbuf, this parameter should be set to zero if you want to use the dynamic form of buffer cache allocation. If this value is non-zero, enough nbufs (one for every two bufpages) will be created as well, unless otherwise specified. It is not yet obsolete, but expect it to be so in a future release.

File System Kernel Parameters

Size of vxfs directory name lookup cache (DNLC)1024vx_ncsize

Size of file-lock table in memory200nflocks

Size of inode table in memoryformulaninode

Size of file table in memoryformulanfile

Hard limit to the number of files a process can have open1024maxfiles_lim

Soft limit to the number of files a process can have open60maxfiles

If on (1), forces all meta-data writes to disk to be asynchronous

0fs_async

Number of 4-KB buffer pages (in 10.x and above, use DBC)0bufpages

Number of buffer headers (in 10.x and above, use DBC)0nbuf

Maximum size of dynamic buffer cache (dbc)50dbc_max_pct

Minimum size of dynamic buffer cache (dbc)5dbc_min_pct

DescriptionDefaultKernelParameter



12-10

fs_async fs_async specifies that file system data structures may be posted to disk asynchronously. While this can speed file system performance for some applications, it increases the risk that a file system will be corrupted in the event of system power loss.

maxfiles maxfiles specifies the soft limit to the number of files that a single program may have open at one time. A program may exceed this soft limit up to the value of maxfiles_lim. In 11i v2, maxfiles is computed at boot and is set to 512, if memory is less than 1 GB. Otherwise it’s set to 2048.

maxfiles_lim maxfiles_lim is the hard limit to the number of files that a single program can open up at one time. This parameter was made dynamic in 11i v1 and the default value was set to 4096.

nfile nfile is the size of the file table in memory, and therefore defines the maximum number of files that may be open at any one time on the system. Every process uses at least three file descriptors. Be generous with this number, as the required memory is minimal. nfile depends on the parameters nproc, maxusers, and npty. This parameter was made dynamic in 11i v2 and was no longer dependent on maxusers. Its value is computed at boot time and is set to 16384 if memory is less than 1 GB; otherwise it’s set to 65536.

ninode ninode is the size of the HFS in-core inode table. By caching inodes in memory the amount of physical I/O is decreased when accessing files. Each unique HFS file open on the system has a unique inode. This table is hashed for performance. At boot time in 11i v2, it’s set to 4880, if memory is less than 1GB; otherwise it’s set to 8196.

nflocks nflocks is the number of file locks available on the system. File locks are a kernel service to enable applications to safely share files. Databases or other applications that make use of the lockf() system call can be large consumers of file locks. Note that one file may have several locks associated with it. This parameter was made dynamic in 11i v2 - at boot time, if memory is less than 1 GB, it’s set to 1200; otherwise it’s set to 4096.

vx_ncsize Along with ninode, this parameter controls the size of the DNLC

(directory name lookup cache). Recent directory path names are stored in memory to improve performance. This parameter is set in bytes. This parameter has been obsoleted in 11i v2. VxFS 3.5 now uses its own internal DNLC.



12-11

12–5. SLIDE: Message Queue Kernel Parameters

Student Notes

Message queues are used by applications to transfer a small to medium amount of information from exactly one process to another process. This information could be in the form of a structure, a string, a numerical value, or any combination thereof. SVIPC message queues have been around for a long time. They are controlled by a number of tunable kernel parameters.

mesg mesg when set (mesg = 1) enables the message queue services in the kernel. This parameter is obsolete as of 11i v2.

msgmap msgmap specifies the size of the free-space map used in allocating message buffer segments for messages.

msgmax msgmax specifies the maximum size in bytes of an individual message. This parameter is dynamic at HP-UX 11i v1.

msgmnb msgmnb specifies the maximum total space consumed by all messages in a queue. This parameter is dynamic at HP-UX 11i v1.

msgmni msgmni specifies the maximum number of message queue identifiers allowed on the system at one time. Each message queue has an associated message

Size of message header space (1 header per message)40msgtql

Size in bytes of segments to be allocated for messages8msgssz

Number of segments in the the system message buffer2048msgseg

Maximum number of messages queue identifiers50msgmni

Maximum size in bytes of message queue space16384msgmnb

Maximum size in bytes of an individual message8192msgmax

Size of message-free-space mapformulamsgmap

Enable or disable IPC messaging (700 only)1mesg


Message Queue Kernel Parameters



12-12

queue identifier stored in non-swappable kernel memory. In 11i v2, the default was raised to 512.

msgseg msgseg is the number of segments in the system-wide message buffer. In 11i v2, the default was raised to 8192.

msgssz msgssz is the size in bytes of each message buffer segment. In 11i v2, the default was raised to 96.

msgtql msgtql is the total number of messages that can reside on the system at any on time. In 11i v2, the default was raised to 1024.

Any of these parameters could affect the performance of an application, simply by virtue of not having enough of the message queue resources available when needed. However, the msgssz and the msgseg parameters also control the size in an in-memory message buffer that is shared by all SVIPC message queues. It needs to be large enough to handle all the messages that may be pending at any one time, but by the same token, should not be much larger than that. It could be taking up far more memory than is necessary. It is not dynamic; it is fixed in size. There also exist in HP-UX 11.x POSIX message queues. There are no tunable parameters for them. POSIX message queues have been shown to consistently out-perform SVIPC message queues.



12-13

12–6. SLIDE: Semaphore Kernel Parameters

Student Notes Semaphores are another form of interprocess communication. Semaphores are used mainly to keep processes properly synchronized to prevent collisions when accessing shared data structures. Semaphores are typically incremented or decremented by a process to block other processes while it is performing a critical operation or using a shared resource. When finished, it decrements or increments the value, allowing blocked processes to then access the resource. Semaphores can be configured as binary semaphores with only two values: 0 and 1, or they can serve as general semaphores (or counters), where one process increments/decrements the semaphore and one or more cooperating processes decrement/increment it. SVIPC semaphores have been around for a long time. They are controlled by several tunable parameters. sema sema (Series 700 only) enables or disables IPC semaphores at system boot

time. This parameter is obsolete as of 11i v2.

semaem semaem is the maximum value by which a semaphore can be changed in a semaphore undo operation.

Maximum number of semaphores in a given set2048semmsl

Maximum value a semaphore is allowed to reach32767semvmx

Maximum number of semaphores that a given process can have undo operations pending on

10semume

Maximum number of processes that can have undo operations pending on a given semaphore

30semmnu

Maximum number of semaphores, system-wide128semmns

Maximum number of sets of semaphores64semmni

Size of free-space map used for allocating new semaphores

formulasemmap

Maximum amount a semaphore can be changed by undo16384semaem

Enable or disable Semaphore code (700 only)1sema


Semaphore Kernel Parameters



12-14

semmap semmap is the size of the free-semaphores resource map for allocating requested sets of semaphores. This semaphore is obsolete as of 11i v2.

semmni semmni is the maximum number of sets of IPC semaphores allowed on the

system at any given time. In 11i v2, the default was raised to 2048.

semmns semmns is the total system-wide number of individual IPC semaphores available to system users. In 11i v2, the default was raised to 4096.

semmnu semmnu is the maximum number of processes that can have undo operations pending on any given IPC semaphore on the system. In 11i v2, the default was raised to 256.

semume semume is the maximum number of IPC semaphores on which a given process can have undo operations pending. In 11i v2, the default was raised to 100.

semvmx semvmx, the maximum value any given IPC semaphore is allowed to reach, prevents undetected overflow conditions).

semmsl Until 11i v2, semmsl was an untunable value in the kernel. It specified the

maximum number of semaphores that could be allocated to a specific semaphore set. In 10.X it was set to 500. In 11.00, it was set to 2048. Now it is a dynamic tunable.

Any of these parameters could affect the performance of an application, simply by virtue of not having enough of semaphore resources available when needed. There also exist in HP-UX 11.x POSIX semaphores. There are no tunable parameters for them. POSIX semaphores have been shown to consistently out-perform SVIPC semaphores.



12-15

12–7. SLIDE: Shared Memory Kernel Parameters

Student Notes Shared memory is reserved memory space for storing data shared between or among cooperating processes. Sharing a common memory space eliminates the need for copying or moving data to a separate location before it can be used by other processes, reducing processor time and overhead, as well as memory consumption. Shared memory is allocated in swappable, shared memory space. Data structures for managing shared memory are located in the kernel. Shared memory segments are much preferred by memory intensive applications, such as Data Bases, since they can be very large and can be accessed without using system calls. SVIPC shared memory use the following tunable parameters. shmem shmem ,when set to true, enables the shared memory subsystem at boot time.

This parameter is obsolete in 11i v2.

shmmax shmmax specifies the maximum shared memory segment size. Dynamic in 11i v1. Also in 11i v2, the default was raised to 1GB.

shmmni shmmni specifies the maximum number of shared memory segments allowed on the system at any one time. Dynamic in 11i v2. Also in 11i v2, the default was raised to 400.

Maximum number of shared memory segments that a single process may attach

120shmseg

Maximum number of total shared memory segments200shmmni

Maximum shared memory segment size64 MBshmmax

Enable or disable Shared Memory (700 only)1shmem


Shared Memory Kernel Parameters



12-16

shmseg shmseg specifies the maximum number of shared memory segments that can be simultaneously attached (shmat()) to a single process. Dynamic in 11i v1. Also in 11i v2, the default was raised to 300.

Any of these parameters could affect the performance of an application, simply by virtue of not having enough shared memory resources available when needed. There also exist in HP-UX 11.x POSIX shared memory. There are no tunable parameters for them. POSIX shared memory segments are implemented through the memory-mapped file architecture, so it could be affected by some of the file system tunable parameters described earlier.



12-17

12–8. SLIDE: Process-Related Kernel Parameters

Student Notes Manage the number of processes on the system and processes per user to keep system resources effectively distributed among users for optimal overall system operation. Manage allocation of CPU time to competing processes at equal and different priority levels. Allocate virtual memory among processes, protecting the system and competing users against unreasonable demands of abusive or run-away processes. maxdsiz maxdsiz defines the maximum size of the static data storage

segment of an executing 32-bit process. In 11i v2, this default has been raised to 1 GB.

maxdsiz_64bit maxdsiz_64bit defines the maximum size of the static data

storage segment of an executing 64-bit process. In 11i v2, this default has been raised to 4 GB.

maxssiz maxssiz defines the maximum size of the dynamic storage

segment (DSS), also called the stack segment, of an executing 32-bit process.

Process-Related Kernel Parameters

Maximum time a process can have the CPU before yielding to next highest priority. Set in “ticks” (10ms).

8timeslice

Maximum number of processes system wideformulanproc

Maximum number of concurrent processes per user ID50maxuprc

Maximum 32 and 64 bit process RSE stack size (IA-64 only)

8 MBmaxressizmaxressiz_64bit

Maximum 32 and 64 bit process text segment size64 MBmaxtsizmaxtsiz_64bit

Maximum 32 and 64 bit process stack size8 MBmaxssizmaxssiz_64bit

Maximum 32 and 64 bit process data segment size256 MBmaxdsizmaxdsiz_64bit




12-18

maxssiz_64bit maxssiz_64bit defines the maximum size of the dynamic

storage segment (DSS), also called the stack segment, of an executing 64-bit process. In 11i v2, this default has been raised to 256 MB.

maxtsiz maxtsiz defines the maximum size of the shared text segment

(program storage space) of an executing process. Note maxtsiz_64bit for 64 bit HP-UX 11.

maxressiz maxressiz defines the maximum size of the register stack engine

(RSE), also called the RSE stack segment, of an executing 32-bit process. This parameter is only found on an IA-64 kernel.

maxressiz_64bit maxressiz_64bit defines the maximum size of the register

stack engine (RSE), also called the RSE stack segment, of an executing 64-bit process. This parameter is only found on an IA-64 kernel.

maxuprc maxuprc establishes the maximum number of simultaneous

processes available to each user on the system. The user ID number identifies a user. The superuser is immune to this limit. In 11i v2, this default is now set to 256.

nproc nproc specifies the maximum total number of processes that can

exist simultaneously in the system. This parameter has been made dynamic in 11i v2, and the new default setting is 4200.

timeslice The timeslice interval is the amount of time one thread is

allowed to accumulate before the CPU is given to the next thread at the same priority. The value of timeslice is specified in units of (10 millisecond) clock ticks.



12-19

12–9. SLIDE: Memory-Related Kernel Parameters

Student Notes Configurable kernel parameters for memory paging enforce operating rules and limits related to virtual memory (swap space). vps_ceiling This parameter is provided as a means to minimize lost cycle

time caused by TLB (translation look-aside buffer) misses on systems using newer PA-RISC devices such as the PA-8000 and the Itanium family that have smaller TLBs and may not have a hardware TLB walker.

If a user application does not use the chatr command to specify a page size for program text and data segments, the kernel selects a page size that, based on system configuration and object size, appears to be suitable. This is called transparent selection.

vps_chatr_ceiling User applications can use the chatr command to specify a

page size for program text and data segments, providing some flexibility for improving overall performance, depending on system configuration and object size. The specified size is then

Enable or disable process text to be swapped locally0page_text_to_local

Maximum number of swchunk units256maxswapchunks

Size in DEV_BSIZE (1-KB) units of swap space units2048swchunk

Maximum number of file system swap areas10nswapfs

Maximum number of device swap areas10nswapdev

Enable or disable pseudo swap1swapmem_on

Default page size used without chatr specification4vps_pagesize

Maximum page size (kbytes) useable with chatr1048576vps_chatr_ceiling

Maximum automatic page size (kbytes) the kernel selects16vps_ceiling


Memory-Related Kernel Parameters



12-20

compared to the page-size value limit defined by vps_chatr_ceiling that is defined in the kernel at system-boot time. If the value specified is larger than vps_chatr_ceiling, vps_chatr_ceiling is used.

vps_page_size Specifies the default user-page size (in Kbytes) that is used by

the kernel if the user application does not use the chatr command to specify a page size.

swapmem_on swapmem_on enables or disables the creation of pseudo-swap, which is swap space designed to increase the apparent total swap space, so that real swap can be used completely, or large memory systems don’t need corresponding swap space.

nswapdev nswapdev specifies an integer value equal to the number of

physical disk devices that can be configured for device swap up to the maximum limit of 25.

nwapfs nswapfs specifies an integer value equal to the number of file

systems that can be made available for file-system swap, up to the maximum limit of 25.

swchunk swchunk defines the chunk size for swap. This value must be

an integer power of two. When the system needs swap space, one swap chunk is obtained from a device or file system. When that chunk has been used and another is needed, a new chunk is obtained. If the swap space is full or if there is another swap space at the same priority, the new chunk is taken from a different device or file system, thus distributing swap use over several devices.

maxswapchunks maxswapchunks specifies the maximum amount of

configurable swap space on the system. In 11i v2 this parameter is obsolete.

page_text_to_local page_text_to_local allows NFS clients to write the text

segment to local swap and retrieve it later. This eliminates two separate text-segment data transfers to and from the NFS server, thus improving NFS client program performance. This parameter does not seem to be defined in 11i v2, even though it has not been identified as an obsolete parameter.



12-21

12–10. SLIDE: LVM-Related Kernel Parameters

Student Notes Two configurable kernel parameters are provided that relate to kernel interaction with the logical volume manager. maxvgs maxvgs defines the maximum number of volume groups configured by

the logical volume manager on the system. no_lvm_disks no_lvm_disks flag notifies the kernel when no logical volumes exist

on the system, i.e. LVM is disabled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.

Enable or disable system to use LVM (0 = false, LVM exists1 = true, no LVM disks exist)

0no_lvm_disks

Maximum number of volume group on the system10maxvgs


LVM-Related Kernel Parameters



12-22

12–11. SLIDE: Networking-Related Kernel Parameters

Student Notes Two configurable kernel parameter are related to the kernel's interaction with the networking subsystems: netisr_priority netisr_priority sets the real-time interrupt priority for the

networking interrupt service routine daemon. By default, it is set to 1 on Uniprocessor systems and 100 on Multiprocessor systems. This parameter is obsolete in 11i v2.

netmemmax netmemmax specifies how much memory is reserved for use by

networking for holding partial Internet protocol (IP) messages which are typically held in memory for up to 30 seconds. When messages are transmitted using Internet protocol, they are sometimes broken into multiple, "partial" messages (fragments). netmemmax simply establishes a maximum amount of memory that can be used for storing network-message fragments until they are reassembled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.

Amount of memory, in bytes, to be allocated for IP packet fragmentation reassembly queue

10% of memnetmemmax

Define priority to assign to the network packet processing daemon (-1 means handle on an interrupt basis – best packet processing performance)

1netisr_priority


Networking-Related Kernel Parameters



12-23

12–12. SLIDE: Miscellaneous Kernel Parameters

Student Notes The following parameters are more or less unrelated. create_fastlinks When create_fastlinks is non-zero, it causes the system to

create HFS symbolic links in a manner that reduces the number of disk-block accesses by one for each symbolic link in a pathname lookup.

default_disk_ir default_disk_ir enables or disables immediate reporting. With Immediate Reporting ON, disk drives that have data caches return from a write() system call when the data is cached, rather than returning after the data is written on the media. This sometimes enhances write performance, especially for sequential transfers. In 11i v2, this parameter is set to 0, by default.

maxusers maxusers does not itself determine the size of any structures in the system; instead, the default value of other global system parameters depends on the value of maxusers. When other configurable parameter values are defined in terms of maxusers, the kernel is made smaller and more efficient by minimizing wasted space due to improperly balanced resource allocations. In

Minimum amount of memory to reserved for use by the paging system

0unlockable_mem

Number of distinct POSIX real-time priorities32rtsched_numpri

Maximum number of concurrent pseudo tty connections60npty

Maximum number of timeouts (for example, alarms) pending

formulancallout

Maximum number of simultaneous users expected32maxusers

Enable or disable immediate reporting on all disks1default_disk_ir

Enable or disable creation of fast symbolic links0create_fastlinks


Miscellaneous Kernel Parameters



12-24

11i v2, the use of maxusers has been eliminated from the formula of every parameter that was dependent on it. Changing its value has no effect on 11i v2.

ncallout ncallout specifies the maximum number of timeouts that can be scheduled by the kernel at any given time. A general rule is that one callout per process should be allowed unless you have processes that use multiple callouts. In 11i v2 this parameter is obsolete.

npty npty specifies the maximum number of pseudo-tty data structures available on the system.

rtsched_numpri rtsched_numpri specifies the number of distinct priorities that can be set for POSIX real-time processes running under the real-time scheduler.

unlockable_mem unlockable_mem defines the minimum amount of memory that always remains available for virtual memory management and system overhead.


13-1


Objectives


• Identify and characterize some network performance problems.

• List some useful tools for measuring network performance problems and state how they might be applied.

• Identify bottlenecks on other common system devices not associated directly with the CPU, disk, or memory.



13-2

13–1. SLIDE: Review of Bottleneck Characteristics

Student Notes The above slide recaps the characteristics related to the three main performance bottlenecks.

CPU Bottlenecks

CPU bottlenecks often exhibit the following characteristics:

• High CPU usage due to lots of processes competing for the CPU Large number of processes in the CPU run queue

• No disk bottleneck problems; disk utilization is low, few to no I/O requests in the disk queues

• No memory bottleneck problems; vhand not needing much, no paging to swap devices

Disk Bottlenecks

Disk bottlenecks often exhibit the following characteristics:

• High CPU usage due to the disk device drivers constantly executing to perform the I/O and user/system processes continually running to submit the I/O requests

• High CPU utilization• High disk utilization• High memory

utilization (with swapping)

• High CPU utilization• High disk utilization

• High CPU utilization

MemoryDiskCPU

Review of Bottleneck Characteristics



13-3

• High disk utilization due to lots of I/O requests being continually submitted.

• No memory bottleneck problems; vhand not needing much, no paging to swap devices

Memory Bottlenecks

Memory bottlenecks often exhibit the following characteristics:

• High CPU usage (system) due to vhand constantly running to free memory pages, the kernel spending lots of time in the memory management subsystem, and the device drivers for the disk writing memory pages to and from swap

• High disk utilization due to memory pages being constantly written to and from the swap devices

• High memory utilization (with swapping) due to free memory falling below LOTSFREE, DESFREE, and MINFREE

Given the above recap, in what order should the three main bottlenecks be checked? When arriving on the scene of an unknown system, where do you start? It would be wise to look for the bottleneck with the most specific symptoms, first. Since the memory bottleneck is the only one to show signs of memory pressure, look for it first. Once you have eliminated that, look for disk bottlenecks. Finally, look for CPU bottlenecks.



13-4

13–2. SLIDE: Performance Monitoring Flowchart

Student Notes The above performance monitoring flow chart assumes glance is being used as the performance-monitoring tool. If glance is not available, the same information can be obtained from a variety of other tools, such as sar and vmstat. The flow chart starts by first looking for symptoms of a memory bottleneck. • Is memory utilization high? • Is there activity to the swap device? Memory bottlenecks are checked for first, since memory bottlenecks often exhibit symptoms of high disk and CPU utilization, which could initially be mistaken for disk or CPU bottlenecks. If the system is not bottlenecked on memory, the second bottleneck checked for through the flow chart is a disk bottleneck. • Is disk utilization high? • Are there disk I/O requests in the disk queue?

Performance Monitoring Flowchart

Look at the memory utilization bar graph.

Ismemory utilization >

95?

Look at the disk utilization bar graph.

Isdisk utilization > 50?

Are there disk I/O requests

in the queue?

Isthere activity on the swap

device?

Yes

No

No

Look at the CPU utilization bar graph.

IsCPU utilization > 90?

Are there requests in the

CPU run queue?

No

No

Potential Memory Bottleneck

Potential Disk Bottleneck

Potential CPU Bottleneck

Yes

Yes

Yes

Yes

Yes

Start glance.

Look for other kinds of bottlnecks, e.g. network

No

No



13-5

Disk bottlenecks are checked for second, as disk bottlenecks often exhibit symptoms of high CPU utilization, but not high memory utilization. If the system is not bottlenecked on disk, the final bottleneck to check for is a CPU bottleneck. • Is CPU utilization high? • Are there processes in the CPU run queue? CPU bottlenecks are checked for after memory and disk bottlenecks, as CPU bottlenecks do not exhibit high memory or CPU utilization. If none of these situations appear to exist, then it is time to check the less common bottlenecks. Networks would be a good possibility, but don’t neglect other hardware or even software resources, such as file locks and semaphores.



13-6

13–3. SLIDE: Review — Memory Bottlenecks

Student Notes The primary symptoms of a memory bottleneck include high memory utilization and activity to the swap device. The glance reports that show activity on the swap device include: (m) Memory Report - shows currently number of VM reads/writes (d) Disk Report - shows VirtMem I/O (v) I/O by log. volume - shows I/O to the swap logical volumes (w) Swap Space Report - show currently used swap space

Also look at vhand and swapper as processes. Are they accumulating any CPU time? Look at the output of vmstat –S. Are pages being paged out? Are processes being swapped out?

Review — Memory Bottlenecks

No

No

Look at the memory utilization bar graph.

Ismemory utilization > 95?

Is there activity on the swap device?(m) Mem Report – Look at VM writes. (d) Disk Report – Look at Virt Memory(v) I/O by LV – Look at swap devices(w) Swap Space – Look at Used (ignore pseudo).

Yes

Potential Memory Bottleneck

Yes



13-7

13–4. SLIDE: Correcting Memory Bottlenecks

Student Notes The above slide reviews some of the ways to correct a memory bottleneck:

• Limit the maximum size of the dynamic buffer cache. This can help to prevent unnecessary paging during periods when the dynamic buffer cache needs to shrink.

• Identify programs (and users) taking up large amounts of memory, and investigate whether the memory usage is warranted or whether the process has memory leaks.

• Consider using the serialize command to keep several memory intensive programs from competing with each other.

• Consider using the Process Resource Manager (PRM) or Work Load Manager (WLM) to favor memory allocation to important processes.

• Adding more physical memory will always help a memory-constrained system.

Correcting Memory Bottlenecks

• Reduce maximum size of dynamic buffer cache.

• Identify programs with large resident set size (RSS).

• Use the serialize command to reduce thrashing.

• Use PRM or WLM to prioritize memory allocations.

• Add more physical memory.



13-8

13–5. SLIDE: Review — Disk Bottlenecks

Student Notes The primary symptoms of a disk bottleneck include high disk utilization and multiple I/O requests in the disk queue. The glance reports that show disk I/O related activity include: (u) I/O by Phys. Disk - shows currently number of reads/writes (B) Global Waits - shows percentage of processes blocked on Disk I/O (d) Disk Report - shows Logical I/O and Physical I/O activity

Also check the output of sar –u (%wio), sar –d, and sar –b (for read cache hit rate and write cache hit rate).

Review — Disk Bottlenecks

Are there Disk I/O requests in the queue ?

(u) I/O by Disk – Look at File System activity.(B) Global Waits – Look at Blocked on Disk I/O.(d) Disk Report – Look at Logical I/O to

Physical I/O ratio.

Look at the disk utilization bar graph.

Isdisk utilization > 50?

Yes

Potential Disk Bottleneck

Yes

No

No



13-9

13–6. SLIDE: Correcting Disk Bottlenecks

Student Notes The above slide reviews some of the ways to correct a disk bottleneck:

• Spread the I/O activity, as evenly as possible, over the disk drives and disk controllers.

• Consider using asynchronous I/O so applications do not have to wait for a physical I/O to complete. The trade-off here is a greater exposure to data loss in the event of a system failure.

• For HFS file systems, increase the fragment and file system block size if large files are being accessed in a sequential manner. For VxFS file systems, increase the block size to improve read-ahead and write-behind. Consider using a fixed extent size.

• Look at customizing file system mount options (especially for VxFS file systems). Recall that, by default, VxFS is mounted to favor integrity, and HFS is mounted to favor performance.

• Consider using vxtunefs to tune the performance of VxFS. Match preferred IO size and read ahead to physical stripe depth.

Correcting Disk Bottlenecks

• Load balance across disk drives and disk controllers.

• Consider asynchronous instead of synchronous I/O.

• Tune file system block and fragment/extent size.

• Tune file system (vxfs and hfs) mount options.

• Tune vxfs file systems with vxtunefs.

• Tune buffer cache for better hit ratios.

• Add additional and faster disk drives and controllers.



13-10

• Verify (and tune) the hit ratio on the file system buffer cache. The ratio of logical reads to physical reads should be a minimum of 10 to 1. The ratio of logical writes to physical writes should be a minimum of 3 to 1.

• Add bigger, better, faster disks and disk controllers.



13-11

13–7. SLIDE: Review — CPU Bottlenecks

Student Notes The primary symptoms of a CPU bottleneck include high CPU utilization and multiple processes in the CPU run queue. The glance reports that show CPU activity include: (a) CPU by Processor - shows CPU load average over last 1, 5, 15 minutes (c) CPU Report - shows CPU activities (g) Process Report - shows CPU hogs in order (see note)

Note Make sure you are looking at processes in CPU order. Use the Thresholds Page (o) of glance and set “CPU” as the sort criteria.

Also check sar –u and sar –q. Use the –M option, if you have a multiprocessor.

Review — CPU Bottlenecks

Look at the CPU utilization bar graph.

IsCPU utilization > 90?

Are there processes in theCPU run queue?

(a) CPU by Proc – Look at Load Average.(g) Global Report – Look at Processes Blocked

on priority.

Yes

Potential CPU Bottleneck

Yes

No

No



13-12

13–8. SLIDE: Correcting CPU Bottlenecks

Student Notes The above slide reviews some of the ways to correct a CPU bottleneck:

• Use the nice or renice commands on lower priority processes (set nice value to 21-39). As a rule of thumb, favor I/O bound programs over CPU-bound programs. I/O-bound programs will block frequently, allowing the CPU-bound programs to run.

• Use the nice or renice command on higher priority processes (set nice value to 0-19).

• Use the rtprio or rtsched commands on highest priority processes. BE CAREFUL! A poorly written process could take over your system and render it useless.

• Schedule large batch jobs, long compiles, and other CPU intensive activity for non-peak hours.

• Add an additional CPU or a faster CPU to the system.

Correcting CPU Bottlenecks

• Use nice to reduce priority of less important processes.

• Use nice to improve priority of more important processes.

• Use rtprio or rtsched on most important processes.

• Run batch jobs during non-peak hours.

• Add another (or faster) processor.



13-13

13–9. SLIDE: Final Review — Major Symptoms

Student Notes Let’s summarize the major bottlenecks and their symptoms: Memory Bottleneck: You know that you have a memory bottleneck if both vhand and swapper are active. This indicates severe memory pressure! Disk Bottleneck: A disk bottleneck will be characterized by disk utilization of at least 50% and at least 3 requests waiting in the request queue. If a controller is the bottleneck, you will see multiple disks with lengthy queues on that controller. Their utilization may not be 50%! The queues are more important than the utilization. CPU Bottleneck: If all of your CPUs are at least 90% busy and they each have run queues that have 3 or more processes in them, you have a CPU bottleneck. If one or more of the processors has empty (or mostly empty) queues, either you are at the limit of your CPU resource, or something is unbalancing the loads on your processors. Network Bottleneck: If your ratio between your collisions/sec and your packets-out/sec is greater than 5%, you have a network bottleneck.

Final Review – Major Symptoms

Memory Bottleneck:

Disk Bottleneck:

CPU Bottleneck:

Network Bottleneck:

All conditions sustained over time!

Both vhand and swapper active

Disk utilization > 50%Request queues > 3

CPU utilization > 90%Run queues > 3 per processor

Collisions/out-bound packets > 5%



13-14

As with any bottleneck symptom, it must be a constant condition – sustained over time to be considered a true bottleneck. Otherwise, it’s a momentary spike which we will keep an eye on, but otherwise ignore.


A-1

Appendix A — Applying GlancePlus Data

This module is an optional self-study for students.

Objectives


• Use case studies to demonstrate how GlancePlus screens can be used to analyze system performance.

• Observe how a performance specialist approaches a tuning task.

Appendix A Applying GlancePlus Data


A-2

A–1. TEXT PAGE: Case Studies—Using GlancePlus The case studies stylized in this module come from the logbooks of HP-UX Performance Specialists and are presented for your consideration. The goal is to help you prepare for your own tasks and adventures. The examples show you possibilities and are not intended to be exact recommendations or solutions to situations that you may encounter. These examples may cause you to think up new questions, in addition to answering some of the classic tuning scenarios. As in most endeavors, there is often much to be gained from reviewing someone else's actions and trying to reverse-engineer their solutions.

An Approach to Monitoring System Behavior The best approach to monitoring your system's performance is to become familiar with how your system usually behaves. This helps you recognize whether a sudden shift in activity is normal or a sign of a potential problem. The first screen that appears when you start GlancePlus in character mode summarizes system-wide activity and lists all processes that exceed the usage thresholds set for the system. The information on this screen tells you if a resource is being used excessively or a process is monopolizing available resources. The Global screen is the usual starting point for any review of system activity and performance. You can use the statistics on the Global screen to monitor system-wide activity, or you may need to refer to the detailed data screens to focus on specific areas of system usage. The examples in this chapter highlight the use of all GlancePlus screens. GlancePlus provides you with valuable information, but optimal use of this information depends on how well you understand your system's operation and what is the normal or usual behavior for that system. As you use GlancePlus to review your system's performance, you will learn to recognize patterns that differ from this norm—patterns that may indicate a problem.

Bottlenecks A bottleneck is the most common type of problem on any system. It occurs whenever a hardware or software resource cannot meet the demands placed on it, and processes must wait until the resource becomes available. This results in blocks and long queues. Your system handles processes much like a freeway system handles traffic. During normal hours, the freeway adequately carries the traffic load, and cars can travel at optimum speed. But, during rush hour, when too many cars try to access the freeway, the lanes become clogged and traffic can slow to a halt. The freeway becomes bottlenecked. Similarly, a bottleneck can occur on your system if the processes you are running need more CPU time than is available or more memory than is configured for the system. A bottleneck also can occur if there isn't enough disk 1/0 bandwidth to move data, or if swap space isn't configured optimally. A bottleneck can be a temporary problem that is easily fixed. The solution may be to rearrange workloads, such as rescheduling batch programs to run late at night. Solving a disk bottleneck may require only spreading disk loads among all the available disks.



A-3

A recurring bottleneck, however, can indicate a long-term situation that is worsening. Perhaps the system was configured to serve fewer users than are now using it, or workloads have gradually increased beyond the system's capacity. The only solution may be a hardware upgrade, but how do you know? If you can identify a bottleneck correctly, you can avoid randomly tuning the system (which can worsen the problem), and you can avoid adding extra hardware that doesn't help performance. You can also avoid expending resources solving a corollary bottleneck—one that is caused by the primary bottleneck.

Characteristics of Bottlenecks

Common system bottlenecks have several general characteristics or symptoms. By comparing these symptoms with the statistics on your GlancePlus screens, you can analyze the performance of your system and detect potential or existing bottlenecks. Although a single symptom may not indicate a problem, a combination of symptoms generally reflects a bottleneck situation.

Symptoms of a CPU Bottleneck

• Long run queue without available idle time • High activity in user mode • Reasonable activity in system mode (high activity may indicate other bottlenecks as well) • Many processes frequently blocked on priority

Symptoms of a Memory Bottleneck

• High swapping activity • High paging activity • Very little free memory available • High disk activity on swap devices • High CPU usage in system mode

Symptoms of a Disk Bottleneck

• High disk activity. • CPU is idle, waiting for I/O requests to complete • High rate of physical reads/writes • Long disk queues

Symptoms of Other 1/0 Bottlenecks

• High LAN activity • Low 1/0 throughputs You may discover that solving one bottleneck uncovers or creates another. It is possible to have more than one bottleneck on a system. In fact, changing workloads are constantly reflected in changing system performance. The goal is not to seek a final solution, but to seek optimal performance at any given time.



A-4

Evaluating System Activity One afternoon Doug noticed that system response had slowed. He ran GlancePlus and looked at the Global screen to view system-wide activity. He saw that the CPU usage was near 100%. Although this is not necessarily a problem, he decided to check it out. Doug then looked at the process summary section of the Global screen, which lists all processes that exceed the usage thresholds set for the system. He noted that a single process accounted for a majority of the near 100% CPU usage. Wanting more information on that particular process, he checked it on the Individual Process screen, which provides detailed information about a specific process. Reviewing that screen, Doug noticed the process was doing no I/0 and was spending all its time in user code. This suggested that the process might be trapped in a CPU loop. After identifying the user's name he telephoned the user to find out if the process could be killed. In this situation, the CPU use for the system did not drop after the user terminated the looping process, because other processes took up the slack. However, response time improved because other processes did not need to wait as often to be given their share of CPU time.

Evaluating CPU Usage Dean was checking the system one afternoon when he noticed a sudden slowdown in system response time. He ran GlancePlus and looked at the Global screen to view system-wide activity. He saw that the CPU usage was near I00% . The other system resources, such as Memory, Disk, and Swap, showed much less use. Further checking revealed that several processes were blocked due to another process using the CPU (PRI), which meant they were waiting for higher priority processes to finish executing. Dean accessed the CPU Detail screen to see how CPU time was allocated to different states (activities). He discovered that real-time activities were using a much larger percentage of the CPU than other activities. Dean returned to the Global screen to check priorities. One user was running with a priority of 127-- an RTPRIO real-time processing priority. Dean knew that this particular program is CPU-intensive and running it at such a high priority would keep other processes from executing. Already it was causing system performance to degrade. He reset the priority for that process to a lower timeshare priority by using the GlancePlus renice command. This allowed other processes more consistent access to the CPU, and system response time improved.

Evaluating Wait States Jose's system was running fine until he installed a new application. Now, every time the application runs, response time degrades. Since the application is the only change to the system, Jose starts by checking how it is using the system. Looking at the Glance Individual Process screen, he sees that CPU utilization is about 7 percent, so that isn't the problem. Next, he checks overall CPU utilization on the system; it's averaging about 48 percent, which means there is sufficient CPU resource to accommodate the new application. Jose checks disk I/Os and notices the application is processing about 5 I/Os per second, most



A-5

of which is virtual memory 1/0. That looks slow to Jose, so he looks at the Wait States screen to find out what the process is waiting on. Jose learns that the process is spending about 7 percent of its time utilizing the CPU (executing), 27 percent of its time waiting for terminal input, and 66 percent of its time waiting on virtual memory. That's a significant amount of time. Jose checks other processes on the system and discovers that they are experiencing similar waits for virtual memory. He realizes that the new application overloads the system's memory. He makes copies of the relevant screens so he can explain the situation to his manager.

Evaluating Disk Usage Vivian's company often runs processes that tax available memory. She keeps track of the situation by checking the Disk Detail screen, which displays both logical and physical 1/0 requests for all mounted disk devices. It also categorizes the physical requests as User, Virtual Memory, System, and Raw requests. This screen shows her when large numbers of physical read and write requests are occurring, a situation that results from excessive page faults by processes. Vivian also checks the virtual memory request rate, since that also will be high when system demand is taxing its physical memory capacity. By paying attention to which processes are active when the virtual memory activity is high, Vivian can make intelligent decisions about redistributing activities to balance the system load. This helps increase overall throughput for the system.

Evaluating Memory Usage Terri's system was experiencing a slowdown in system response time. She checked the Global screen to get an overall picture. All four system resources (CPU, Disk, Memory, and Swap) were near 100%. A large portion of the disk bar activity showed virtual memory activity. In addition, the swapper system daemon appeared to be running continuously. Terri realized that this indicated a possible memory bottleneck. She checked the Memory Detail screen, which provides statistics on memory management events such as page faults, number of pages paged-in and paged-out, and the number of accesses to virtual memory. The screen indicated that Free Memory was 0.0 MB, indicating a lack of usable memory, and the Swap In/Outs showed a rate above 1 per second. Concluding that the problem was a memory bottleneck, Terri returned to the Global screen to study the active processes. She knew that a memory bottleneck can be relieved by adding more memory or by reducing the memory demands of active processes. In this case, she suspected that high swap rates were caused by the large Resident Set Sizes (RSS) for the most active processes. One test program showed a large RSS that appeared to grow at a constant rate. Examining this situation more closely, Terri discovered the program had a "memory leak." It allocated memory using malloc() but did not free up memory using free(). The process's memory allocation increased steadily, causing memory pressure on the system. She talked with the developer, who studied the program code and found the memory leak. The test program was changed and recompiled to use far less memory, thus alleviating the memory bottleneck and improving system response time.



A-6

Evaluating I/O by File System Ingrid noticed that system performance degraded drastically when the system was doing swapping. Looking at the Global screen, she observed that the swapper process was running and that virtual memory use counted for a high percentage of the disk utilization. She checked the Disk I/0 by File System screen to verify which disk was busiest. The Disk I/0 by File System screen provides details of the I/0 rates for each file system or mounted-disk partition. This information is useful for balancing disk loads. When she looked at the Disk I/O by File System screen, Ingrid saw that one disk was being utilized more than all other disks on the system. The disk most utilized was a swap disk. Ingrid decided to add additional swap disk areas to the system to alleviate the load on that one disk. She also might have considered allocating dynamic swap areas on existing under-utilized file systems

Evaluating Disk Queue Lengths Ray had already determined that his system had a disk 1/0 bottleneck. By reading the Global screen, he noticed that the disk utilization was almost always at 100%. He had checked the Disk 1/0 by File System screen, which showed that several disks were being heavily utilized. What Ray wanted to find was a way to ease the situation. He studied the Disk Queue Lengths screen, which shows how well disks are able to process 1/0 requests. He wanted to determine which of the busy disks had the longest delays for service. He knew that "busy" disks did not necessarily have a long queue length. High disk utilization is not a problem unless processes must wait to use the disk. For example, using a high percentage of the lines on a telephone system is not a problem unless calls cannot get through. Ray also knew that long queue lengths meant several disk requests must wait while that drive is servicing other requests. For example, if all phone lines are busy, incoming calls must wait to connect. Once he had a clear picture of the situation, Ray reduced the large queue lengths by moving several files to different file systems to distribute the workload more evenly.

Evaluating NFS Activity Paul works on a system that is used as a network file system server. One local disk is NFS-mounted from several different nodes on the LAN. One afternoon, Paul noticed poor response time on the system. The file system mounted by remote systems was very active. Paul reviewed the NFS Detail screen, which provides current statistics on in-bound and out-bound network file system (NFS) activity for a local system. He wanted to determine which remote system was using the disk the most. He observed a large In-bound Reads rate from one system. This led him to examine that remote system to find out why it was over-utilizing the NFS-mounted disk. His examination pinpointed the situation to a single user on the remote system. The user was making repeated, unnecessary greps to files on the NFS-mounted disk. Paul explained the problem to the user and worked with her to lessen the heavy disk use. This reduced the load



A-7

on the NFS server and improved overall response time.

Evaluating LAN Activity Lee noticed a slow response time for applications using datacom services to access data across the local area network. He checked the LAN Detail screen to see what was causing the problem. The LAN Detail screen describes four functions for each local area network card configured on the system. On networked systems, this information can show potential bottlenecks caused by heavy LAN activity. Lee noticed that the Collision and Errors rates were higher than usual. This information led him to investigate whether processes were competing for LAN resources or overloading the LAN software or hardware. In this case, an application that was improperly written using netipc() was causing a bottleneck. Once this program was stopped, other programs using the LAN were able to improve their response time.

Evaluating System Table Utilization When Debbie was running a program on the system, the program failed, giving this error message:

sh: fork failed - too many processes To decide whether or not to reconfigure the value of nproc in her kernel, Debbie needed to find out how much room she had in the Process Table. She referred to the System Table Utilization Detail screen, which provides information on the use and configured sizes of several important internal system tables. The information on this screen provides feedback on how well the kernel configuration is tuned to the average workload on the system. Debbie confirmed that she had indeed run out of room in the Proc Table. She knew that usually the system buffer cache is fully utilized. Other tables can be proactively monitored in order to reconfigure the appropriate kernel variable before she reached a limit.

Evaluating an Individual Process Cliff noticed that one process seemed to be running quite slowly. He ran GlancePlus and looked at the detail information for that process. He then examined the statistics on the Single Process Detail screen, which provides detailed information about a specific process. Cliff knew that if the process was running slowly because of a memory shortage, he would see an increase in context switches and fault counts. He noticed that the I/0 read and write counts were large and that the process was doing a lot of I/O. He checked what the process had been blocking on and noticed a high percentage for Disk 1/0 blocks. He suspected that the process was slowed because of competition for disk throughput capacity.



A-8

Had the process shown a high percentage of being blocked on priority, it would have meant the process was ready to run but was unable to do so because the CPU was being used by processes with higher dispatching priorities.

Evaluating Open Files Kathryn is developing an application for communicating with remote systems. When a request is received, the application opens a socket and sends the specified data. However, when Kathryn tests the application, no data is received by the remote system. To find out what happened she checks the Glance Open Files screen. When she looks for the opened socket, she discovers that it never opened. She returns to her application to look for the coding error.

Evaluating Memory Regions One day while reviewing Glance's Global Summary screen, Nancy notices that several processes have very large resident set sizes. Could this mean a potential problem with the applications' memory usage? She wonders if she should begin planning to increase physical memory size to accommodate additional users in the future. She knows current system performance is fine and memory size seems adequate, but she wants to prevent any future degradation in performance. Before making any decisions she reviews Glance's Memory Regions screen to analyze the situation more closely. She discovers that all of the affected processes have a shared memory region of about 200 KB. When added to the private DATA and TEXT regions, this accounts for the large resident set sizes. By checking the virtual address of the shmem region, she determines that the same shared memory region is being used by the processes. Because it is a shared region, it is physically in memory only once, but Glance displays it for each process attached to the shared memory region. Nancy smiles when she sees this, because it means that no problem exists. By using the shared memory region the processes are using far less memory than it appears.

Evaluating All CPUs Statistics

Rosalie works on a multiprocessor system. While checking the All CPUs screen one day she noticed that one CPU seemed to be consistently busier than the others. Realizing that overall system throughput would be improved if the load were balanced among the processors, she decided to investigate the situation. As she studied the All CPUs screen she noticed that one PID always seemed to be the last PID executing on CPU 1, her busiest CPU. When Rosalie checked mpctl(), she saw that the process had been assigned to CPU 1. Using mpctl -f, she reassigned the process to be a floater, so that the system could determine which processor should run the process. Rosalie then checked for other processes that had been assigned to CPU I and reassigned them as floating processes. After doing so, she rechecked the All CPUs screen and observed that the load appeared more even among all the processors, thus alleviating a potential bottleneck on any single CPU.

Evaluating Activity on Logical Volumes Lately when Yuki uses GlancePlus to check his system, he notices that the Global Disk Utilization bar — displayed in the top portion of every Glance screen — is often close to 100%. Yuki's system has multiple disk drives, but he knows that the global disk utilization figure indicates the activity on the busiest disk.



A-9

Yuki would like to spread the disk I/0 more evenly among the drives to avoid potential I/O bottlenecks. With that goal in mind, he first checks the Disk Detail screen. It shows that logical disk activity is high. For details, he goes to the Logical Volumes screen, where he notices a high write activity on logical volume /dev/vgOO/lvol12. Getting out of Glance and into the UNIX shell, he types vgdisplay -v /dev/vg00 to ascertain the physical disk names associated with the volume. Back in Glance, Yuki views the Disk Queue Lengths screen to determine the busiest disks in the volume. Then he checks the Disk Detail screen to find out whether disk activity was caused by system or user activity. Yuki notices that the Virtual Memory physical accesses are low, indicating application rather than system activity. He checks the Open Files screen to find which application was creating so many writes to the disk. Voila! Fred is running his baseball pool again! Yuki pays a visit to Fred. After discussing Fred's I/0 needs, Yuki returns to his console to balance the I/0 load, using LVM commands to rearrange the logical volumes. Now it’s time to grab your toolbox, pop the hood, and take a look.

Good Luck!



A-10


Solutions-1

Solutions

Solutions


Solutions-2

1–11. LAB: Establishing a Baseline

Directions

The following lab exercise establishes baselines for three CPU-bound applications and one disk-bound application. The objective is to time how long these applications take when there is no activity on the system. These same applications will be executed later on in the course when other bottleneck activity is present. The impact of these bottlenecks on user response time will be measured through these applications.

1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline

2. Compile three C programs long, med and short by running the BUILD script # ./BUILD

3. Time the execution of the long program. Make sure there is no activity on the system. # timex ./long Record Execution Time real: user: sys: Answer:

Varies with system configuration – in the order of 10’s of seconds to minutes. Example output follows from an rp2430 server: # timex ./long The last prime number is : 49999 real 3:37.89 user 3:35.68 sys 0.12 Example output follows from an rx2600 server: # timex ./long The last prime number is : 99991 real 2:53.24 user 2:51.74 sys 0.06

Solutions


Solutions-3

4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time real: user: sys: Answer:

Varies with system configuration – should be about one half of long. Example output follows from an rp2430 server: # timex ./med The last prime number is : 49999 real 1:52.68 user 1:51.55 sys 0.08 Example output follows from an rx2600 server: # timex ./med The last prime number is : 99991 real 1:33.71 user 1:33.02 sys 0.04

5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time real: user: sys: Answer:

Varies with system configuration – should be about one eigth to one tenth of med. Example output follows from an rp2430 server: # timex ./short The last prime number is : 49999 real 10.88 user 10.70 sys 0.05 Example output follows from an rx2600 server:

Solutions


Solutions-4

# timex ./short The last prime number is : 99991 real 8.56 user 8.49 sys 0.03

6. Time the execution of the diskread program. # timex ./diskread Record Execution Time real: user: sys: Answer:

Varies with system configuration – in the order of tens of seconds. Example output follows from an rp2430 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real 28.01 user 0.02 sys 0.53 Example output follows from an rx2600 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real 28.69 user 0.01 sys 0.13

7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why. Answer:

We first assume that there is no other load on the system. In the case of a classic number crunching CPU hog (long, med, and short are all these) there will be no system calls (except for the final terminal output) and the program only needs CPU time in usr mode.

Solutions


Solutions-5

As there is only one process, there is no waiting. This is shown by the “real” time being very close to the sum of the sys and usr times for the process. long, med, and short only do calculations and make no call on kernel resources during their execution, so the usr time is very high compared to the sys time. This is not the case for diskread. The program makes very little demand on the CPU, shown by the sum of usr and sys being quite small compared to the real or “wall clock” time. The huge difference between real time and usr+sys time proves that the program is waiting on disk I/O most of the time. Also note that sys is much higher than usr meaning that the program is bound on system calls (disk I/O) rather than computation when it does execute.

Solutions


Solutions-6

1–12. LAB: Verifying the queuing Theory

Directions

The performance queuing theory states that as the number of jobs in a queue increases, so will the response time of the jobs waiting to use that resource. (This lab uses the short program compiled from /home/h4262/baseline/prime_short.c). Example figures below are from a C200 workstation

1. In terminal window 1, monitor the CPU queue with the sar command.

# sar -q 5 200

2. In a second terminal window, time how long it takes for the short program to execute.

# timex ./short &

Answer: rp2430: # timex ./short & [1] 10050 # The last prime number is : 49999 real 10.85 user 10.70 sys 0.05 # timex ./short & [1] 6486 rx2600: root@r265c145:/home/h4262/baseline # The last prime number is : 99991 real 8.59 user 8.50 sys 0.03 How long did the program take to execute? 8 to 11 secs. How does this compare to the baseline measurement from earlier? A little longer due to the overhead of sar.

Solutions


Solutions-7

3. Time how long it takes for three short programs to execute?

# timex ./short & timex ./short & timex ./short &

How long did the slowest program take to execute? _____________________ How did the CPU queue size change from first window? __________________

Answer:

rp2430: # timex ./short & timex ./short & timex ./short & [1] 10203 [2] 10205 [3] 10206 # The last prime number is : 49999 real 29.86 user 10.68 sys 0.01 The last prime number is : 49999 real 32.07 user 10.67 sys 0.01 The last prime number is : 49999 real 32.35 user 10.67 sys 0.01 rx2600: # timex ./short & timex ./short & timex ./short & [1] 6690 [2] 6692 [3] 6694 # The last prime number is : 99991 real 25.08 user 8.48 sys 0.00 The last prime number is : 99991 real 25.56 user 8.48 sys 0.00

Solutions


Solutions-8

The last prime number is : 99991 real 25.60 user 8.48 sys 0.00 How long did the slowest program take to execute? 25 to 34 secs, around three times longer than one occurrence of the program. If you have a multiprocessor, the time will be distributed over the number of processors – with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since we’re only running three processes here (not including sar), three processors or more than three processors would show the same results. How did the CPU queue size change from first window? The sar –q shows that the average cpu queue length (first field) increases by three times when three programs are run concurrently.

4. Time how long it takes for five short programs to execute?

# timex ./short & timex ./short & timex ./short & \ timex ./short & timex ./short &

How long did the slowest program take to execute? _________ How did the CPU queue size change from first window?________

Answer:

rp2430: # timex ./short & timex ./short & timex ./short & \ timex ./short & timex ./short & [1] 10212 [2] 10214 [3] 10216 [4] 10218 [5] 10220 # The last prime number is : 49999 real 53.98 user 10.68 sys 0.01 The last prime number is : 49999 real 54.08 user 10.68 sys 0.01

Solutions


Solutions-9

The last prime number is : 49999 real 54.08 user 10.68 sys 0.01 The last prime number is : 49999 real 54.08 user 10.67 sys 0.01 The last prime number is : 49999 real 54.15 user 10.68 sys 0.01 rx2600: # timex ./short & timex ./short & timex ./short & \ ./short & timex ./short & timex ./short & [1] 6737 [2] 6739 [3] 6741 [4] 6743 [5] 6745 # The last prime number is : 99991 real 42.52 user 8.49 sys 0.00 The last prime number is : 99991 real 42.56 user 8.48 sys 0.00 The last prime number is : 99991 real 42.59 user 8.48 sys 0.00 The last prime number is : 99991 real 42.67 user 8.48

Solutions


Solutions-10

sys 0.00 The last prime number is : 99991 real 42.75 user 8.48 sys 0.00 How long did the slowest program take to execute? 43 to 54 secs If you have a multiprocessor, the time will be distributed over the number of processors – with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since we’re only running five processes here (not including sar), five processors or more than five processors would show the same results.

How did the CPU queue size change from first window? It increased by 5 while the test is being run.

5. Is the relationship between elapsed execution (real) time and the number of running programs linear?

Answer:

Yes very much so. The fastest program in the last case (where 5 programs are running) takes five times longer than with one program. You can draw a graph and go to 10 programs if you are unsure! Typing the command with more than 10 occurrences gets a little tedious! You will find a linear relationship in any case.

6. Comment about the overhead of switching from one process to another.

Answer:

The overhead of task switching is very low. If it were not, the relationship in the above tests would not be linear. If there is an overhead, it looks like we will not see it unless there are hundreds of processes being switched.

Solutions


Solutions-11

2–68. LAB: Performance Tools Lab The goal of this lab is to gain familiarity with performance tools. A secondary goal is to get familiar with the metrics reported by the tools, although they will be explored in depth during the next days.

Directions

Set up: Change directories to: # cd /home/h4262/tools Execute the setup script: # ./RUN Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion. These results were obtained on a C200 workstation running 11i. Remember the absolute numbers are not important here but you should be drawing similar conclusions. 1. How many processes are running on the system?

Which tools can you use to determine this?

Answer:

top Gives the number of running processes in the summary portion of the screen: 119 processes: 96 sleeping, 17 running, 6 zombies

ps ps -e | wc -l and subtract 1 for the headers and 1 each for ps and wc.

glance Look at the table screen (t page) and see the current size of the proc table.

sar sar –v 2 10 Look at the proc-sz field. gpm Gives the count at the top of the Process List report.

2. Are there any real-time priority processes running? If so, list the name and priority. What

tools can you use to determine this?

Answer:

syncer, midaemon, lab_proc2, sometimes swapper. ttisr and prm3d will also be seen on 11/11i systems running at pri –32. This is the posix real time range which is even higher than the normal UNIX real time priorities.

glance Global, PRI column (Turn off all filters) top PRI column

Solutions


Solutions-12

gpm Use the filters to filter priorities <128 (process list/configure/filters) ps –el PRI column (Try this command: # ps –el | grep –v PRI | sort –k 7,7n | more The highest priority processes will be listed on the top.) Remember a real time priority is anything less than 128

3. Are there any nice'd processes on the system? If so, list the name and priority for each. What tools can you use to determine this?

Answer:

glance Go through each single process screen. (Default is 20. 21- 39 is nice; 0 to 19 is nasty, i.e. anti-nice.)

gpm Process list, select a process by double clicking. top NI column ps –el NI column (Try this command: # ps –el | grep –v NI | sort –k 8,8n | more The nasty processes will be listed at the top and the niced processes will be at the bottom.) On 11i the following were “nasty”: diagmond, diaglogd, psmctd, memlogd, krsd The following were “nice”: All 6 <defunct> zombie processes (see below), lab_proc4

4. Are there any zombie processes on the system? If so, how many are there? What tools can you use to determine this?

Answer:

A zombie is a terminated process whose parent is still running, but has not called wait() for the child.

Zombies whose parent has terminated should eventually be adopted by the init process, which will issue a wait() on the zombie. Therefore, a zombie whose parent has terminated should eventually disappear.

What resources do zombies consume? memory (<= 20 pages), table entries

Solutions


Solutions-13

top The number of zombie processes is shown in the summary portion

of the screen: 119 processes: 96 sleeping, 17 running, 6 zombies glance and gpm By design, they do not currently report zombies, unless the process

entered the zombie state during the interval. ps –el Z in S(tate) column and <defunct> in the Comm(and) column

5. What is the length of the run queue? What are the load averages? What tools can you use to

determine this?

Answer:

glance CPU screen (c), page 2, shows a RUNNING LOAD AVERAGE, but has been labeled incorrectly in older versions as run-queue. The All CPUs screen (a) shows the 1-, 5-, and 15-minute load averages whereas the CPU screen shows the interval load average.

gpm CPU button or Reports/CPU Graph show the interval load average. Reports/CPU report shows the interval load average. Reports/CPU by processor shows the 1-, 5-, and 15-minute load averages.

uptime 1-, 5-, and 15-minute load averages top Load averages: 5.39, 5.27, 5.20 and interval load average sar –q Average run queue size over interval. vmstat r column is the run queue size over the interval. xload 10-second load average over time. The run queue length on the test system was around 5 no matter how it was measured. There is also a hardware dependant approach that can be used on servers using the console Hex display code….

6. How many system processes are running? What tools can you use to determine this?

NOTE: A system process is defined as a process whose data space is the kernel's data space. (i.e. swapper, vhand, statdaemon, unhashdaemon, supsched, etc.) ps reports their size as zero. Others as below.

There are three ways this can be determined. If you get stuck on this question, move on. Don't spend more than a few minutes trying to answer this question.

HEX display (front panel or console) shows size of runQ in the second digit. F31F means there are three processes in the runQ and one CPU. FA1F means 10 or more in the runQ. MPE uses this as a percent utilization number.

The runQ is an instantaneous value and can never be a fractional number. The load average is based on the runQ, but includes short sleepers (discussed in CPU section).

Solutions


Solutions-14

Answer:

top PA-RISC: RES = 16K (32-bit kernel) or 32K (64-bit kernel) per thread. IA-64: RES = 80K per thread glance PA-RISC: 16K (32-bit kernel) or 32K (64-bit kernel) per thread on global

screen. IA-64: 80K per thread on global screen. ps –el The second bit in the F column value indicates a system process. (See

the man page for ps.) F column = 3, PPID column = 0, and SZ column = 0.

(Try this command: # ps –el | grep “ 3 “| more This will list all the system processes. No, technically, init is NOT a system process.) This amounts to 17 processes on the test 11i system.

7. What percentage of time is the CPU spending in different states? What tools can you use to determine this?

Answer:

glance Bar graph, CPU screen (c): displays detailed CPU state information. Per-process (S): details per/process CPU utilization. (Display can be toggled between cumulative/interval (C) and percent/absolute (%).)

gpm Main window Reports/CPU or Reports/CPU by processor sar user/system/waiting for io/idle top user/nice/system/idle/block/swait/intr/ssys (context

switch) SEE /usr/include/sys/dk.h for CPUSTATE (CP_USER, etc.). block is spinlock percentage (on MP systems only). This figure is obsolete at 11i. swait is alpha semaphore percentage (on MP systems only). This figure is obsolete at 11i.

vmstat user/system/idle iostat –t us/ni/sy/id

NOTE: Always watch the first line in vmstat or iostat. It is the average since

bootup. Use vmstat -z to clear the sum structures for vmstat. There is no similar option for iostat.

8. What is the size of memory?

What is the size of free memory? What tools can you use to determine this?

Answer:

glance M(emory) screen, Free Memory, Phys Memory, Avail Memory, Total VM, Active VM, Buf Cache Size

gpm Reports/Memory Report vmstat Free (in pages) avm (active virtual memory, in pages) includes on-disk

pages

Solutions


Solutions-15

top Real (real active), virtual (virtual active), free in KB. /etc/dmesg Amount of physical and available memory.

The memory stats from top are misleading. The values in brackets are figures for processes that are regarded as “busy” whatever top means by that. In most utilities, “busy” or “active” means that the process is in the RUN state, or has executed within the last 20 seconds. The “real” figures are a summation of the resident set sizes for all processes (sum of the RES field). This is not the amount of physical memory in the system. The only way to get the true physical memory in the system is through glance/gpm or dmesg. The boot info seen in dmesg with the physical memory figure will be lost if general console messages (e.g. file system full) have overwritten the limited buffer space. vmstat figures are generally accurate with the “free” field agreeing well with glance/gpm and top. Remember top reports memory in 1K units while vmstat reports memory in 4K pages so multiply the vmstat figures by 4 to compare them to top.

9. What is the size of the swap area(s)? What is the percentage of swap utilization? What tools can you use to determine this?

Answer:

glance Bar graph (reserved/used) w(swap):%used (device and filesys) and MB reserved by swap device, MB available and MB used

gpm Reports/Swap space (glance w) Reports/System Table Info/System Tables Graph Report

NOTE: Graph shows high water mark (nice!)

sar -w N/A (only shows size of swap queue and swapping rates) vmstat -S N/A (only shows paging and swapping rates) top: N/A swapinfo -t KB avail/used/free/% used by swap device bdf –b File system swap space used/avail swapinfo can be misleading unless you know what you are looking at. To remove the confusing issue (pseudo swap) enter a swapinfo –t to correctly calculate and include pseudo swap issues by taking the “total” figures. This will be explained in detail in the module on swap space management. # swapinfo -t Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 524288 11276 513012 2% 0 - 1 /dev/vg00/lvol2 reserve - 204360 -204360 memory 180940 84488 96452 47% total 705228 300124 405104 43% - 0 - # Note that we have used 43% of our swap space and not 2%.

Solutions


Solutions-16

10. What is the size of the kernels incore inode table?

How much of the inode table is utilized? What tools can you use to determine this?

Answer:

glance t(able) screen, page two gpm Reports/System Table Info/System Tables Report (NOT Graph) sar –v Used/size/overflows

11. Are there any CPU bound processes running (processes using a “lot” of CPU)?

If so, what is the name of the process? What steps did you take to determine this?

Answer:

glance Global screen and single process screen gpm Can sort by CPU utilization and can filter by > 0 top Automatically lists the processes by CPU utilization ps -el Cpu hogs often have large C counts (<=255) (Try this command: # ps –el | grep –v “ C “ | sort –k 6,6n This will list the most active processes at the end.) lab_proc5 and lab_proc3 are the main CPU users. They are consuming close to 100% of the CPU between them. This is not normal behavior!

12. Are there any processes running which are using a “lot” of memory? (A "lot" is relative, i.e.

a large RSS size compared to other processes.) If so, what is the name of the process? What steps did you take to determine this? Is memory utilization changing?

Answer:

glance Global screen: RSS, Per-process screen: RSS and VSS sizes gpm Reports/Process List (glance g) top SIZE (KB: text/data/stack), RES (KB: resident size) ps –el SZ column (size in 1-K blocks of the core image including only text +

data + stack) (Try this command: # ps –el | grep –v SZ | sort –k 10,10n

Remember that the inode cache may contain entries for files that are closed, if it doesn't need to flush it out to open a new file Its size is the maximum number of unique files that can be open system wide.

Solutions


Solutions-17

This will list the largest processes at the end.) lab_proc1 has a much larger SZ (ps –el output) size than most other processes. This program is 8MB in core and could be regarded as a memory hog. Remember that SZ is in pages, multiply by 4K to get the actual figure.

13. Are there any processes running which are doing any disk I/O ? If so, what is the name of

the process? What steps did you take to determine this? What are the I/O rates of the disk bound processes? What files are open by this (these) process(es)?

NOTE: No processes are really doing a lot of physical disk I/O. However,

lab_proc3 is doing a LOT of logical I/O.

Answer:

glance i screen will periodically show lab_proc3 as largest disk user s(ingle) process screen, open files, will show actual open files and offset, which MIGHT be indicative of the amount of I/O

gpm Reports/Process List sar -d Reports physical disk I/O for the system overall. sar -b Reports and compares logical I/O to physical I/O

Notice sar –b reporting very high logical read I/Os. The lab_proc3 process is very busy with disk reads but the system has cached all the data in the buffer cache preventing physical disk I/O. # sar -b 2 2 HP-UX workstn B.11.11 U 9000/782 01/22/01 15:39:38 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 15:39:40 0 19646 100 1 2 60 0 0 15:39:42 0 21454 100 0 3 83 0 0

14. What is the current rate of semaphore or message queue usage? What tools can you use to determine this?

Answer:

sar –m The ONLY tool to show message and semaphore ops/sec. glance The single process screen shows messages sent/received.

Semaphore and message usage was effectively zero in the lab test as none of the test programs manipulate semaphores or messages. These resources will be covered in a later module. Relational data bases (Oracle, Informix, Sybase etc) are big users of such resources. # sar -m 2 2

Solutions


Solutions-18

HP-UX workstn B.11.11 U 9000/782 01/22/01 15:41:58 msg/s sema/s 15:42:00 0.00 3.98 15:42:02 0.00 2.00

15. Is there any paging or swapping occurring? What tools can you use to determine this?

Answer:

glance m(memory) screen: page faults, paging request, KB paged in, KB paged out, deactivations/reactivations, KB swapped in, KB swapped out, VM reads, VM writes

gpm Reports/Memory Graph (or Memory button): page OUTS, swap OUTS Reports/Memory Report: in/out/etc.

sar –w Swapping only vmstat Paging (pi/po) vmstat -S Swapping (si/so) & Paging (pi/po)

In terms of simple UNIX commands, vmstat is the way to go. The sar command does not understand paging (more in the module on memory management!) and is measuring the swap rate only. See the pi and po fields below from vmstat. This system is not paging at all so we can be confident that there is no swapping activity. # vmstat 2 3 procs memory page faults cpu r b w avm free re at pi po fr de sr in sy cs us sy id 4 0 0 78478 1593 6 1 0 0 0 0 0 108 2096 171 3 2 94 4 0 0 78478 1552 2 0 0 0 0 0 0 117

glance/gpm do give good paging and swapping detail on the “m” screen and the data should tie in with vmstat.

16. What is the system call rate? What tools can you use to determine this?

Answer:

glance: CPU screen, page 2; Single process screen (s), then (L) reads/writes/opens/closes/ioctls/forks/vforks/messages sent and received

gpm Reports/CPU Report Reports/Process List, select for single process screen

sar –c Total/reads/writes/forks/execs vmstat First sy column (under “faults”)

sar and vmstat give good data that should agree here. The system call rate can be used as an indication of how busy your system is once you have established the normal range for its value. There is no absolute “good” or “bad” figure as this depends on:

Solutions


Solutions-19

a) How powerful your system is. b) How many cpus you have. c) What processes you are running.

Example live data from the test system (C200 running 11i): # sar -c 2 10 HP-UX workstn B.11.11 U 9000/782 01/22/01 16:03:24 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 16:03:26 4417 1205 1188 0.00 0.00 85899336 2038 16:03:28 4623 1252 1249 0.00 0.00 88524288 4096 Note the system is doing over 4k system calls per second (scall/s), over half of which can be attributed to reads (sread/s) and writes (swrit/s). See how dramatically this number can be changed by adding a simple extra process (you might like to try this while monitoring sar –c). # dd if=/stand/vmunix of=/dev/null bs=64 & # sar -c 2 10 HP-UX workstn B.11.11 U 9000/782 01/22/01 16:10:37 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 16:10:39 21528 10461 9586 4.98 4.98 172712128 16302 16:10:41 19882 7790 6878 5.00 5.00 148155392 7168 The glance/gpm tools become invaluable when you want to know which processes are hitting the system with so many system calls. More on this later.

17. What is the buffer cache hit ratio? What tools can you use to determine this?

Answer:

sar –b Read and write hit ratios glance Disk screen (d), page 2 shows both ratios gpm Reports/Disk report shows both ratios

See answer 13 for example output of sar –b. 18. What is the tty I/O rate? What tools can you use to determine this?

Answer:

sar -y: iostat -t:

The quickest tool to use here is iostat –t. In general modern system administrators care less and less about terminal I/O as almost all users connect to application servers and services over LAN networks. An exception to this rule would be the case of modems. A

Solutions


Solutions-20

system with multiple modems may experience a modem “storm” with meaningless data being fired at the host by a bad modem line. iostat –t will catch this problem as a high tin (tty characters read per second) value. # iostat -t tty cpu tin tout us ni sy id 0 5 3 1 3 94

19. Are there any traps (interrupts) occurring? What tools can you use to determine this?

Answer:

vmstat –s Traps since bootup (should probably zero it out first)

NOTE: examples of trap call page faults overflow/underflow (integer and floating point) HPMC/LPMC floating point emulation traps

When traps occur, the normal flow of a program is interrupted and work is done to take care of a problem before normal program instructions can be continued. For example, trying to access a data page which is not in memory and is out on disk would result in a “page fault” causing the execution of the program to stop, waiting for the required data to come in from disk. Like with system call rates (see 16), there is no “good figure” but you are advised to monitor the trap rate as a “sanity” reference. Clear the vmstat counters with vmstat –z first. Note that the numbers seen have been generated in the time between the two vmstat commands! Example data from our C200 running 11i. The parameter list has been reduced and the trap events are in bold. # vmstat -z # vmstat -s 12 swap ins 12 swap outs 0 pages swapped in 0 pages swapped out 8636 total address trans. faults taken 2633 page ins 0 page outs 20 pages paged in 0 pages paged out 6594 cpu context switches 7640 device interrupts 11335 traps 153724 system calls

Solutions


Solutions-21

20. What information can you collect about network traffic? What tools can you use to

determine this?

Answer:

glance l(an) screen: packets in/out, collisions, errors NFS global screen (N): rd/wr rates, calls, response time, etc.

nfs by system (n): reads/writes/response time by system for both client and server requests

gpm Reports/LAN Graph (or NW button): packets in/out per second, Reports/Network by LAN, Reports/NFS Global Activity, Reports/NFS by system, Reports/NFS by operation

netstat: Sockets in use -m memory buffers in use (NOTE: No longer works in HP-UX 11.X) -i packets in/out, errors in/out, collisions by interface -s packets, bytes, retransmissions, duplicate, acks, checksum errors, timeouts, etc. -rs routine statistics

nfsstat Server rpc and nfs stats; client rpc and nfs stats A later module will cover networking performance issues in more detail. The most important performance metric in a CSMA/CD (Ethernet or 802.3) network is collision rate. This is available in glance/gpm and in the Network module we will learn how to extract this data using lanadmin.

21. What information can be gathered on CPUs in an SMP environment?

What tools can you use to determine this?

Answer:

glance a(ll)CPU detail utilization and load averages by CPU gpm: Reports/CPU Info/CPU by Processor: util, ld avg, CS rate, fork rate,

last PID sar –Mu/Mq Utilization/queue lengths by CPU (with –u or –q) top CPU number on which a process is assigned and utilization per CPU.

sar –M output has changed at 11i. The output of sar –M will look identical to sar –u (or –q) if the system only has one cpu. For MP systems you are presented with the sar ( -u or –q) data on a per CPU basis. This becomes helpful in measuring the balance of processes across processors. top displays its MP information by default, giving the cpu reference for each process. This information is hidden if there is only one processor or if the –h option is used. The a page of glance also measures the balance per CPU and indicates the last process to run on any given CPU. Very useful.

Solutions


Solutions-22

22. What information can be gathered on Logical Volumes? What tools can you use to determine this?

Answer:

glance v(LVM)screen reads/writes/MWC hits and misses by LV or VG gpm Reports/Disk Info/I/O by LV vgdisplay, General information on Volume Groups, Logical Volumes and Physical lvdisplay, Volumes, Use –v for details pvdisplay bdf,mount Information on file systems on logical volumes. Physical disk layout (the positioning of data on disk) is important for performance. The lvdisplay –v and pvdisplay –v commands are the best way of finding out the precise layout of logical volumes on physical disks. In a later module we will look in detail at mirroring and striping techniques used to manipulate physical disk layout to our advantage.

23. What information can be gathered on Disk I/O? What tools can you use to determine this?

Answer:

glance d(isk) logical/physical reads/writes, user/VM/system/raw/NFS i(o): by file system, logical/phys/VM u(queue) queue length and utilization by spindle v(LV): see above gpm press disk bottleneck button (queue) Reports/Disk Info/Disk Report (glance d)

I/O by disk (glance u + type [phys, logl, VM, FS, System, RAW]) I/O by fs (glance i + blocksize, util, logl, sys, VM) I/O by LV (glance v)

iostat KB/sec, seeks/sec, millisec/seek by spindle (NOTE: millisec/seek no longer reported – permanently reports 1.0)

sar –d %busy, average queue, io per sec, blocks/sec, average wait time, average service time

iostat is a redundant tool because it’s data is not as useful or as accurate as that obtained from sar –d. The most important place to start looking for disk I/O info lies with the disks themselves. sar cannot understand LVM layouts and only “sees” the disk as a whole. Use glance/gpm on the individual disks once you have identified them with sar –d. Below is some example data collected at the start of the tools lab. Stop the lab with ./KILLIT and start it again with ./RUN to see some disk I/O. # ./KILLIT Killing the lab procs Removing the files # ./RUN cc -wall +DAportable cpu_hog.c -o cpu_hog cc -wall +DAportable vm_bnd.c -o vm_bnd cc -wall +DAportable io_bnd.c -o io_bnd

Solutions


Solutions-23

cc -wall +DAportable zombie.c -o zombie # sar -d 2 4 HP-UX workstn B.11.11 U 9000/782 01/22/01 17:22:54 device %busy avque r+w/s blks/s avwait avserv 17:22:56 c0t6d0 3.50 0.50 6 160 3.55 9.24 17:22:58 c0t6d0 2.50 0.50 4 84 1.90 10.75 17:23:00 c0t6d0 3.00 0.50 5 84 2.57 8.75 17:23:02 c0t6d0 5.50 0.50 6 159 3.76 14.07 Average c0t6d0 3.62 0.50 5 122 3.04 10.90 We would not consider 3-5% busy as being a bottleneck here. We will see much higher disk loads later!

Shut down the simulation by entering:

# ./KILLIT

Solutions


Solutions-24

3-18. LAB: gpm and glance Walk-Through

Directions

The following lab is intended to familiarize the student with gpm and glance. To achieve this result, the lab will “walk the student through” a number of windows and tasks in both the ASCII and X-Windows versions of gpm and glance.

The Graphical Version GlancePlus

1. Log in. If you have not already done so, please log into the system with the user name and password provided by your instructor.

2. Start GlancePlus. From a terminal window, invoke GlancePlus by entering gpm.

# gpm In a few seconds gpm will come up. The first thing will be a license notification informing you that you are starting a trial version of GlancePlus, along with ordering and technical support information. On the gpm Main screen, you will see four graphs for CPU, Memory, Disk, and Networking. By default, the graphs are in the resource history format. This means that for each interval (configurable) there will be a data point on the graph, up to the maximum number of intervals (also configurable).

3. Interval Customizations. Click on Configure in the menu bar, and select Measurement. Set the sample interval to 10 seconds and the number of graph points to 50. This will allow you to see up to 500 seconds of system history. Click on OK.

NOTE: This setting will be saved for you in your home directory in a file called

$HOME/.gpmhp-system_name. This means that all GlancePlus users will have their customizations saved.

Start a program from another window:

# cd /home/h4262/cpu/lab1; # ./RUN &

4. Main Window. Below each graph within the GlancePlus Main window, you will find a button. These buttons display the status color of adviser symptoms. This is a powerful feature of GlancePlus that we will investigate later. Clicking on one of these buttons displays details of that particular graph.

To view the advisor symptoms from the main window, select:

Adviser -> Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.

Solutions


Solutions-25

View CPU details:

Click the CPU button.

To view a detailed report regarding the CPU, select:

Reports -> CPU Report

Select:

Reports -> CPU by Processor

This is a useful report, even on a single processor system.

5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? .

Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal.

6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own.

An alarm is simply a notification that a symptom has been detected. From the main window, select:

Adviser -> Symptom History

For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window.

7. Process Details. Close all windows except for the main window. Select:

Reports -> Process List

This shows the “interesting” processes on the system (interesting in terms of size and/or activity). To customize this listing, select:

Configure -> Choose Metrics

Solutions


Solutions-26

This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related metrics available in GlancePlus. Note that the familiar ? button is also available from this window.

Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK.

8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select

Configure -> Sort Fields

The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select

Configure -> Arrange Columns

Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order.

9. More Customizations. It is possible to modify the definition of interesting processes by selecting:

Configure -> Filters

An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows:

Change Enable Filter to ON

Solutions


Solutions-27

Change Filter Relation to >= Change Filter Value to 3.0 Change Enable Highlight to ON Change Highlight Relation to >= Change Highlight Value to 3.0 Change Highlight Color to any LOUD color

Reset the logic condition make to OR, then click OK. Verify the filter took effect.

10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced.

In the Process List window, select the proc8 process. To access the Admintools, select:

Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select:

Admin -> Kill Click OK, and note the process is no longer present.

11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process.

Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated.

After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows.

12. Exit GlancePlus. From the Main window, select:

File -> Exit GlancePlus

Solutions


Solutions-28

13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance.

NOTE: Never run glance or gpm in the background.

If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window… # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface.

14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system.

Retrieve online help related to this window by typing h, which brings up a help menu. Select:

Current Screen Metrics

Use the cursor keys to select

CPU Util

NOTE: This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the “Page Down” key to toggle to the next page of help.

Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select:

Screen Summaries

Use the cursor keys to select Global Bars

From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.

Solutions


Solutions-29

15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen.

Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads.

16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports.

Feel free to use this time to ask the instructor "How Do I . . .?" types of questions.

Glance GlancePlus (gpm) COMMAND FUNCTION "REPORT" *a All CPUs Performance Stats CPU by Processor b Back one screen *c CPU Utilization Stats CPU Report *d Disk I/O Stats Disk Report e Exit f Forward one screen *g Global Process Stats Process List h Help *i I/O by Filesystem I/O by Filesystem j Change update interval *l Lan Stats Network by LAN *m Memory Stats Memory Report *n NFS Stats NFS Report o Change Threshold Options p Print current screen q Quit r Redraw screen *s Single process information Process List, double-click process *t OS Table Utilization System Table Report *u Disk Queue Length Disk Report,double-click disk *v Logical Volume Mgr Stats I/O by Logical Volume *w Swap Stats Swap Detail y Renice process Administrative Capabilities z Zero all Stats ! Shell escape ? Help with options <CR> Update screen data

Solutions


Solutions-30

4–16. LAB: Process Management

Directions

The following lab is designed to manage a group of processes. This includes observing the parent-child relationship and modifying process nice values (and thus indirectly priorities) with the nice/renice command .

Modifying Process Priorities

This portion of the lab uses glance to monitor and modify nice values of competing processes.

1. Change directory to /home/h4262/baseline.

# cd /home/h4262/baseline

2. Start seven long processes in the background.

# ./long & ./long & ./long & ./long & ./long & ./long & ./long & [1] 15722 [2] 15723 [3] 15724 [4] 15725 [5] 15726 [6] 15727 [7] 15728

3. Start a glance session. Answer the following questions.

How much CPU time is each long process receiving? _________sec ________%

Answer:

Hint: Change the sample period to 10 secs (hit the j key). This will give you more time to think and makes “per second” calculations easier!

The CPU should be balanced between the seven processes with each getting around 14% of the CPU (i.e. 5/7 seconds each for a 5 second interval and 10/7 seconds each for a 10 second interval). This is seen in the CPU Util field of the main glance window. Notice that the programs all have similar priority around 248-249 which is towards the bottom of the pile. If you have a multiprocessor, the processes will quickly distribute themselves among all available processors. However, the overall metrics should stay the same – with the exception of the overall length of time that the processes take.

Solutions


Solutions-31

How are the processes being context switched (forced or voluntary)? ______________

Answer:

Select one of the “long” processes using the glance s key. Make sure the PID being suggested is the right one or enter the correct PID.

In the first column of info you will find the “Forced CSwitch” and “Voluntary CSwitch” metrics.

You will notice that (almost!) all context switched are forced when you compare the two figures. This is normal for a CPU hog process. It never leaves the CPU on his own accord and is always told to leave by the scheduler. We saw 7.7-9.6 context switches per second for the period for each of the processes on an rp2430. All of the context switches were forced. On a multiprocessor, there would be the same number of context switches taking place, however fewer processes would be sharing the same processor. How many times over the interval is the process being dispatched? ___________

Answer:

Again, we can look to the first column of the selected process’ resource summary page. Find the “Dispatches” metric. This is a measure of how often the process is getting onto the CPU with the summation of “Forced CSwitch + Voluntary CSwitch” measuring how often the process gets switched out. On a multiprocessor, each processor would have fewer processes wanting its resource, so, each process would be selected more often. What is the ratio of system CPU time to user CPU time? ____________

Answer:

Look to the first column of the selected process info again and you will find the “System CPU” metric. This will be zero or close to zero on any system. By using the C (upper case) key we can switch between metrics for the last interval (10 seconds if you are following the solutions) or the total over the period of tracking. It makes no difference how you look at it, these processes do not process system calls. They are typical CPU hogs that crunch numbers and do nothing else. All the CPU is User/Nice/RT. What are the processes being blocked on? __________________ Answer PRIority The most frequent event that is blocking the process is shown by the “Wait Reason” metric at the bottom of the first column of Process Resource info (the same page we have been looking at all along). In this case it is PRI, short for Priority.

Solutions


Solutions-32

The process has been blocked because it is timeslicing with all the other processes. Each time it is switched out, it is placed at the end of the queue in true round-robin fashion. Thus, it is no longer the most eligible process to run and the scheduler has chosen another. For more stats go to the Wait States page for this process (softkey F2 or hit W) notice that the process is blocked on Priority for 80-90% (6/7) of the time and the rest of the time it is on the CPU. There are no other active wait states. The seven long processes are in a circular fight to get to the top of the pile(s). What are the nice values for the processes? _______ Answer 24 A Bourne-based shell (Bourne, Korn, Posix, bash) always places background processes at a nice level 4 higher than the calling shell. The standard nice value of our shell is 20 so the child background jobs inherit 24 as the nice value. One exception is the C shell which runs background processes at the same nice value as the shell.

4. Select one of the processes and favor it by giving it a more favorable nice value.

What is the PID of the process being favored? ____________ Answer: To change the processes nice value, enter: # renice –n -5 <PID of selected process> Be careful! This forces a negative offset of 5 from 20 (the standard nice value) and not the current nice value (24). The nice value in this case will end up at 15, which is more favorable than the others, still at 24. Watch that process’s percentage of the CPU over several display intervals with glance or top. What effect did it have on the process? _____________________________

_______________________________________________________________________ Answer: The effect on the process is that it will race away from the others, consuming approx 50-60% of the CPU! This might take a little time to settle down at 50-60%. Give it several intervals to complete its adjustment.

5. Select another long process and set the nice value to 30.

# renice –n 10 <PID of another selected process> What effect did that have on that process? ___________________________________ ______________________________________________________________________

Solutions


Solutions-33

Answer: This really turns the process into a loser! The priority of the process drops to 251-252, preventing the process from getting much action. If you select the process and look in the first column of the Process Resource page you will see that it is being dispatched but not very often. You will see the process getting less than 2% of CPU but not much more. Each of the other processes will take up the excess, with the majority of the excess going to the process with the nice value of 15.

6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with:

# kill $(ps –el | grep long | cut –c18-22)

Solutions


Solutions-34

5-24. LAB: CPU Utilization, System Calls, and Context Switches

Directions

General Setup

Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available:

# vgdisplay –v | grep Name (Note which disks are already in use by LVM) # ioscan –fnC disk (Note any disks not mentioned above, select one) # pvcreate -f <raw disk device file> # vgextend vg00 <block disk device file>

In either case:

# lvcreate -n vxfs vg00 # lvextend -L 1024 /dev/vg00/vxfs <block disk device file> # newfs -F vxfs /dev/vg00/rvxfs # mkdir /vxfs # mount /dev/vg00/vxfs /vxfs # prealloc /vxfs/file <75% of main memory in bytes>

The lab programs are under /home/h4262/cpu/lab0

# cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system — otherwise results are unpredictable. If the executables are missing, generate them by typing:

# make all

CPU Utilization: System Call Overhead

Use the dd command to size the read and write operations. Thus their number can be varied to change the number of system calls used to transfer the same amount of information. Then we can see the overhead of the system call interface. The first command loads the entire file into buffer cache. # timex dd if=/stand/vmunix of=/dev/null bs=64k Now we take our measurements. # timex dd if=/stand/vmunix of=/dev/null bs=64k real __ user __________ system ____________

Solutions


Solutions-35

# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real __ user __________ system ____________ Answer: Results for an rp2430: # timex dd if=/stand/vmunix of=/dev/null bs=64k 282+1 records in 282+1 records out real 0.04 user 0.00 sys 0.03 # timex dd if=/stand/vmunix of=/dev/null bs=2k 9055+1 records in 9055+1 records out real 0.15 user 0.02 sys 0.12 # timex dd if=/stand/vmunix of=/dev/null bs=64 289765+1 records in 289765+1 records out real 3.82 user 0.56 sys 2.95 Results for an rx2600: # timex dd if=/stand/vmunix of=/dev/null bs=64k 728+1 records in 728+1 records out real 0.03 user 0.00 sys 0.03 # timex dd if=/stand/vmunix of=/dev/null bs=2k 23299+1 records in 23299+1 records out real 0.18 user 0.02

Solutions


Solutions-36

sys 0.13 # timex dd if=/stand/vmunix of=/dev/null bs=64 745575+1 records in 745575+1 records out real 4.57 user 0.54 sys 3.39 Notice that the last case is much slower due to the number of system calls being made. The block size is a factor of 1000 times less than in the first case causing 1000 time more calls to the read() and write() system calls. Try a sar –c 2 10 in another window while the test is being run to see this effect. None of these effects are anything to do with physical disk I/O as the whole vmunix file is coming from buffer cache. Prove this to yourself with a sar –b 2 10 while the test is being run. Notice the 100% read cache hit rate.

System Calls and Context Switches

This lab shows you the maximum system call and context switch rates that your system can take. Three programs are supplied: • syscall loads the system with system calls of one type • filestress (shell script) generates file system-related system calls • cs loads the system with context switches 1. What is the system call rate when your system is "idle"? ________________

Answer Around 400-500 on our test systems

# sar -c 2 2 (rp2430) HP-UX r206c42 B.11.11 U 9000/800 03/16/04 11:18:56 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 11:18:58 602 3 1 0.00 0.00 203272 8151 11:19:00 264 4 1 0.00 0.00 4096 512 Average 434 3 1 0.00 0.00 103741 4341 # sar -c 2 2 (rx2600) HP-UX r265c145 B.11.23 U ia64 04/06/04 # 10:57:02 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 10:57:04 719 3 1 0.00 0.00 260840 0 10:57:06 434 3 1 0.00 0.00 4096 4096

Solutions


Solutions-37

Average 577 3 1 0.00 0.00 132668 2043

2. Run filestress in the background. What is the system call rate now? What system calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar –c 10 4 Answer Around 20000-30000 on our test systems

# sar -c 10 4 (rp2430) HP-UX r206c42 B.11.11 U 9000/800 03/16/04 11:19:43 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 11:19:53 17423 3112 1158 130.07 130.07 29710218 147104 11:20:03 12420 3577 2627 63.40 63.40 32159540 8192 11:20:13 23240 4227 1337 192.60 192.60 39581900 17818 11:20:23 26279 3884 700 212.10 212.00 40309248 134963 Average 19840 3700 1456 149.54 149.51 35438766 77037 # sar -c 10 4 (rx2600) HP-UX r265c145 B.11.23 U ia64 04/06/04 11:02:40 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 11:02:50 39624 4530 1619 290.51 290.51 92426384 77746 11:03:00 28069 5618 3883 171.70 171.60 69435392 80282 11:03:10 27178 5214 3320 189.40 189.40 67771592 62259 11:03:20 31592 5057 2814 222.70 222.60 72799840 91750 Average 31618 5105 2909 218.60 218.55 75612445 78009

• What system calls are generated by filestress? Answer read() and write().

3. Terminate the filestress process by entering the following commands:

# kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22)

4. Run the syscall program and again answer question 2. Is the system call rate lower or

higher than with filestress? Why? Answer Syscall rate is higher than with filestress. Non-blocking system calls

Produce rates up to 138,000 per second on an rp2430 and up to 290,000 on an rx2600.

# sar -c 10 4 (rp2430)

Solutions


Solutions-38

HP-UX r206c42 B.11.11 U 9000/800 03/16/04 11:36:11 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 11:36:21 137619 2 0 0.00 0.00 42863 3376 11:36:31 136788 2 0 0.00 0.00 4506 1946 11:36:41 137887 2 0 0.00 0.00 5734 3277 11:36:51 138224 2 0 0.00 0.00 3686 1229 Average 137629 2 0 0.00 0.00 14171 2457 # sar -c 10 4 (rx2600) HP-UX r265c145 B.11.23 U ia64 04/06/04 11:15:51 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s 11:16:01 287322 27 1 0.50 0.40 60560 4092 11:16:11 288439 7 1 0.00 0.00 233472 20480 11:16:21 289239 9 1 0.00 0.00 27853 4096 11:16:31 290331 4 0 0.00 0.00 14746 3277 Average 288832 12 1 0.12 0.10 84104 7985

The syscall program uses the open() and close() system calls and does no I/O as such. These system calls do not block the process which turns into a CPU hog, only blocking on Priority in the glance Wait States page. Kill the syscall program, before proceeding.

# kill $(ps –el | grep syscall | cut –c18-22)

5. Using cs, compare the number of context switches on an idle system and a loaded system. Idle ________ Loaded ______________ Answer

# sar -w 2 2 (rp2430) HP-UX r206c42 B.11.11 U 9000/800 03/16/04 11:39:27 swpin/s bswin/s swpot/s bswot/s pswch/s 11:39:29 0.00 0.0 0.00 0.0 86 11:39:31 0.00 0.0 0.00 0.0 83 Average 0.00 0.0 0.00 0.0 85 # ./cs & # sar -w 2 2 HP-UX r206c42 B.11.11 U 9000/800 03/16/04 11:41:43 swpin/s bswin/s swpot/s bswot/s pswch/s 11:41:45 0.00 0.0 0.00 0.0 47733

Solutions


Solutions-39

11:41:47 0.00 0.0 0.00 0.0 47471 Average 0.00 0.0 0.00 0.0 47602 # sar -w 2 2 (rx2600) HP-UX r265c145 B.11.23 U ia64 04/06/04 11:22:07 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:09 0.00 0.0 0.00 0.0 150 11:22:11 0.00 0.0 0.00 0.0 177 Average 0.00 0.0 0.00 0.0 164 # ./cs& # sar -w 2 2 HP-UX r265c145 B.11.23 U ia64 04/06/04 11:22:57 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:59 0.00 0.0 0.00 0.0 81912 11:23:01 0.00 0.0 0.00 0.0 82728 Average 0.00 0.0 0.00 0.0 82319 Notice that we go from an idle context switch rate (pswch/s) of approx 100 processes per second up to 47000 or 82000! Additionally, you can look at the glance CPU Report (c). Note how much of the CPU time is spent doing context switching. (About 15%)

6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem.

# kill $(ps –el | grep cs | cut –c18-22) # rm –f /vxfs/file # umount /vxfs

Solutions


Solutions-40

5–25. LAB: Identifying CPU Bottlenecks

Directions

The following labs are designed to show symptoms of a CPU bottleneck.

Lab 1

1. Change directory to /home/h4262/cpu/lab1

# cd /home/h4262/cpu/lab1

2. Start the processes running in the background.

# ./RUN

3. Start a glance session and answer the following questions.

What is the CPU utilization? _______ Answer At or near 100% What are the nice values of the processes receiving the most CPU time? _______ Answer 10 What is the average number of jobs in the CPU run queue? ______ Answer Varies with configuration – should be approx 3-5 # uptime 12:05pm up 4 days, 19:38, 7 users, load average: 4.73, 3.31, 2.26 #

4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs? Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs.

Glance global (g) page output (rp2430): PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------- proc8 27425 1 215 root 50.1/49.4 138.6 0.0/ 0.0 168kb 1 proc3 27420 1 221 root 48.4/49.2 138.0 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1125.1 0.0/ 0.0 26.6mb 19 proc5 27422 1 168 root 0.0/ 0.2 0.5 4.0/ 4.0 168kb 1

Solutions


Solutions-41

proc2 27419 1 168 root 0.0/ 0.2 0.5 3.8/ 3.9 168kb 1

Glance global (g) page output (rx2600): PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------- proc3 26194 1 219 root 50.8/49.3 81.5 0.0/ 0.0 268kb 1 proc8 26199 1 216 root 48.5/49.4 81.6 0.0/ 0.0 268kb 1 scopeux 2105 1 127 root 0.0/ 0.0 13.3 0.0/ 0.0 20.7mb 1 prm3d 2139 1 168 root 0.0/ 0.1 77.5 0.0/ 0.0 49.5mb 19 ia64_corehw 2989 1 154 root 0.0/ 0.1 65.9 1.1/ 0.0 1.8mb 1 proc2 26193 1 168 root 0.0/ 0.1 0.2 7.6/ 7.7 256kb 1 proc5 26196 1 168 root 0.0/ 0.1 0.2 7.6/ 5.8 256kb 1 proc3 and proc8 are the main CPU hogs. They have been run with nice values of 10! The process pair are accounting for almost 100% of the CPU between them. With the same CPU rates and RSS (Resident Set Size), it is likely that these are identical programs. Selecting one of these processes in glance reveals no disc I/O and a context switch profile which is always forced. proc5 and proc2 also manage to execute with 0.2% CPU utilization each. Again these look like a pair. If you select one of these programs and look at the Process Resource page you can see a small amount of write disk I/O, most of which is logical. The main Wait Reason for this process is SLEEP. It would appear that these processes do a small amount of disk I/O and then call sleep() and pause for some time intentionally. proc1 and proc7 are a pair. On selecting one of these we see a nice value of 39! These processes find it nearly impossible to get CPU with the real time pair of proc3 and proc8 taking all the CPU resource. If you watch the Dispatches metric on the Process Resource page they can be seen to get one or two slices of CPU very infrequently. You should also see that for every Dispatch (these are rare), these is always an accompanying Forced Cswitch. You can conclude that these processes would be CPU hogs if they were not so crippled by their own high nice values and the aggression of proc3 and proc8. proc4 and proc6 are the last pair. They have standard nice values of 20 and seem to do nothing but call the sleep() system call. They are being dispatched slightly more frequently than proc1 and proc7 and they are always subject to Voluntary CSwitch. These processes are not CPU hogs. They also do no disk I/O of any kind. None of the above processes had any significant memory size.

5. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute.

# timex /home/h4262/baseline/short & How long did the program take to execute? _______ Answer: # timex /home/h4262/baseline/short & (rp2430) The last prime number is : 49999

Solutions


Solutions-42

real 56.44 user 10.66 sys 0.01 # timex /home/h4262/baseline/short & (rx2600) # The last prime number is : 99991 real 1:02.38 user 8.48 sys 0.00

6. Compare your results to the baseline established in the lab exercise in module 1, step 7.

Answer: Total execution time is over 5 times slower!

7. End the CPU load by executing the KILLIT script.

# ./KILLIT

Solutions


Solutions-43

Lab 2

1. Change directory to /home/h4262/cpu/lab2.


# ./RUN 3. In one terminal window, start glance.

In a second terminal window run

# sar -u 5 200. Answer the following questions:

What does glance report for CPU utilization? _______ Answer: Should be greater than 50%. (the more, the merrier!) Output of rp2430 glance (g) page below

PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------- proc2 27761 1 1 root 92.0/92.3 723.2 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1137.2 0.0/ 0.0 26.6mb 19

Output of rp2430 glance (a) page below

CPU BY PROCESSOR Users= 1 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------- 0 Enable 93.2 0.5/ 0.6/ 1.7 1724 27761

Output of rx2600 glance (g) page below

PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------- proc2 26469 1 1 root 71.5/71.6 47.1 0.0/ 0.0 288kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1137.2 0.0/ 0.0 26.6mb 19

Output of rx2600 glance (a) page below

CPU BY PROCESSOR Users= 1 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------- 0 Enable 73.1 0.0/ 0.2/ 0.8 1432 26469

What does sar report for CPU utilization? ________

Solutions


Solutions-44

Answer: sar reports the CPU is mostly idle. Util is less than 10%. # sar -u 5 200 HP-UX r206c42 B.11.11 U 9000/800 03/16/04 13:45:58 %usr %sys %wio %idle 13:46:03 4 2 0 94 13:46:08 0 1 0 99 13:46:13 1 1 0 98 13:46:18 0 0 0 100 13:46:23 1 1 0 98 This is very strange; the tools totally disagree with each other. sar is reporting over 90% idle with glance reporting over 80% busy! They cannot both be right. Which one do you trust? The output of top is also confused. It sees the busy process but still reports 90% idle! Load averages: 0.50, 0.56, 1.41 (rp2430) 112 processes: 99 sleeping, 13 running Cpu states: LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS 0.50 0.6% 0.0% 2.2% 97.2% 0.0% 0.0% 0.0% 0.0% Memory: 91236K (64076K) real, 365020K (299140K) virtual, 30120K free Page# 1/8 TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND pts/tb 27761 root 1 20 1664K 148K sleep 14:49 92.56 92.40 proc2

Load averages: 0.03, 0.12, 0.68 (rx2600) 128 processes: 107 sleeping, 20 running, 1 zombie Cpu states: LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS 0.03 0.2% 0.0% 0.0% 99.8% 0.0% 0.0% 0.0% 0.0% Memory: 197664K (154768K) real, 608492K (523032K) virtual, 23516K free Page# 1/ 10 TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND tty1p0 26469 root 1 20 3304K 252K sleep 4:08 71.77 71.64 proc2

What is the priority of the process receiving the most CPU time? _______ Answer The proc2 process is the culprit and is running with the high UNIX real time priority of 1. How much time is the process spending in the sigpause system call? ______ Answer Now this is where the clues start!

Solutions


Solutions-45

The Wait States for proc2 show that it is blocked on SLEEP when it is not running. This wait state is the result of the process putting itself to sleep. To see the system calls that the process is calling hit the F6 softkey or L key once you have selected the process. glance will collect the data and present it after about 10-20 seconds. rp2430: System Calls PID: 27761, proc2 PPID: 1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------- sigpause 111 449 99.7 0.35218 1497 74.1 1.17095 sigcleanup 139 450 100.0 0.00166 1500 74.2 0.00553

rx2600: System Calls PID: 26469, proc2 PPID: 1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------- sigpause 111 525 100.9 1.49255 1500 74.2 4.26847 sigcleanup 139 525 100.9 0.00143 1500 74.2 0.00408 The sigpause() call is causing the sleep blocks that we see in the Wait States page. The interesting thing is that the rate at which the program calls sigpause() is always 100 times per second. That is 10 ms (milli-seconds) between calls. How can a program be so coordinated with the wall clock and what is it using to achieve this synchronization? Can you tell what it is yet? How is the process being context switched (forced or voluntary)? ______ Answer Review the Resource Summary page again for proc2 and you will see that all the context switches are Voluntary. This is not the expected case for a CPU hog. How is it that a process can use so much CPU and never be seen by the scheduler and thrown off the CPU? The Bottom Line If you examine the code of the lab you will see that the process arms a trap waiting for the system hardware clock (the tick) to pop. When this occurs the program wakes up and wastes CPU for an amount of time that your instructor has tuned to be just under 10ms (see waste.c). The program then arms the trap again and voluntarily goes to sleep waiting for the next hardware tick. Remember the UNIX scheduler analyzes system activity on the hardware tick intervals and our program has done a good job at never being around at these times! It’s a free lunch.

Solutions


Solutions-46

The standard UNIX tools (sar and top for example) feed on the scheduler’s internal statistics for measurement data and so they get the wrong story. glance however uses the midaemon, which recalculates performance stats every time a process returns from a system call. And you cannot play this game without system calls.

4. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute.

# timex /home/h4262/baseline/short &

How long did the program take to execute? _______ Answer: (rp2430) # timex /home/h4262/baseline/short & The last prime number is : 49999 real 2:32.86 user 10.88 sys 0.07 (rx2600) # timex /home/h4262/baseline/short & # The last prime number is : 99991 real 30.86 user 8.51 sys 0.01 Our old benchmark figure was around 10 seconds (real) so this is significantly slower. This program is running in the gaps that the proc2 process is leaving. You could further modify waste.c to use more of the tick period.

5. End the CPU load by executing the KILLIT script.

# ./KILLIT

Solutions


Solutions-47

6–18. LAB: Memory Leaks There are several performance issues related to memory management, memory leaks, and swapping/paging, protection ID thrashing…. Let's investigate a few of them.

1. Change directories to /home/h4262/memory/leak:

# cd /home/h4262/memory/leak

Memory leaks occur when a process requests memory (typically through the malloc()or shmget() calls) but doesn't free the memory once it finishes using it. The five processes in this directory all have memory leaks to different degrees. The following solution data came from an rp2430 server with 640MB of physical memory and 2GB of device swap, and an rx2600 server with 1012MB of physical memory and 2GB of device swap. The rp2430 was running HPUX 11i v1 and the rx2600 was running 11i v2.

2. Before starting the background processes, look up the current value for maxdsiz using the kmtune command on 11i v1 and the kctune command on 11i v2. On the rp2430:

# kmtune –lq maxdsiz Answer: Varies with configuration – probably 64MB if you are pre 11i and 256MB for 11i v1.

# kmtune-lq maxdsiz Parameter: maxdsiz Current: 0x10000000 Pending: 0x10000000 Default: 0x10000000 Minimum: - Module: - Version - Dynamic: No #

The number is in hex… Converting this to decimal = 268435456 = 256MB On the rx2600:

# kctune –avq maxdsiz Answer:

Varies with configuration – probably 1GB for 11i v2.

# kctune -avq maxdsiz Tunable maxdsiz

Solutions


Solutions-48

Description Maximum size of the data segment of a 32-bit process (bytes) Module vm Current Value 1073741824 [Default] Value at Next Boot 1073741824 [Default] Value at Last Boot 1073741824 Default Value 1073741824 Constraints maxdsiz >= 262144 maxdsiz <= 4294963200 Can Change Immediately or at Next Boot

The number is in decimal = 1073741824 = 1GB The default maxdsiz on 11i v2 is 1 GB. This will make proc1 very slow in reaching its limits. You can change maxdsiz to a more reasonable number for this lab exercise by: # kctune maxdsiz=0x10000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000 Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? rp2430: # vmstat 2 2 procs memory page faults cpu r b w avm free re at pi po fr de sr in sy cs us sy id 3 0 0 75182 92519 3 0 0 0 0 0 0 104 408 138 1 0 99 3 0 0 75182 92465 3 0 1 0 0 0 0 106 214 75 0 0 100

We have around 92000 free pages which equates to 368MB. rx2600: # vmstat 2 2 procs memory page faults cpu r b w avm free re at pi po fr de sr in sy cs us sy id 2 0 0 124095 97927 466 165 298 0 0 0 2 1134

Solutions


Solutions-49

47856 476 14 19 67 2 0 0 124095 96427 137 26 69 0 0 0 26 536 21509 470 3 3 94

We have around 97000 free pages which equates to 388MB.

3. Use the RUN script to start the background processes:

# ./RUN

4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions — fairly quickly, before the memory leaks get too large.

Go for the m page of glance for the best info. You have to be quick off the mark after starting the leak programs! MEMORY REPORT Users= 1 Event Current Cumulative Current Rate Cum Rate High Rate ------------------------------------------------------------------------------- Page Faults 588 1301 113.0 116.1 137.1 Page In 1 33 0.1 2.9 6.1 Page Out 0 0 0.0 0.0 0.0 KB Paged In 0kb 36kb 0.0 3.2 6.9 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 0 3 0.0 0.2 0.5 VM Writes 0 0 0.0 0.0 0.0 Total VM : 384.9mb Sys Mem : 182.3mb User Mem: 96.9mb Phys Mem: 640.0mb Active VM: 342.1mb Buf Cache: 32.4mb Free Mem: 328.4mb

• What is the current amount of free memory? Answer: Varies with configuration Already this has dropped to 328.4MB

• What is the size of the buffer cache? Answer: Varies with configuration In our case this is 32.4MB

• Is there any paging to the swap space?

Answer: Varies with configuration No not in the last sample, see KB paged Out above

• How much swap space is currently reserved?

Answer: Varies with configuration Get this from swapinfo. Again you need to do this just after the programs start: In our case around 249MB.

# swapinfo -tm

Solutions


Solutions-50

Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2048 0 2048 0% 0 - 1 /dev/vg00/lvol2 reserve - 379 -379 memory 1013 330 683 33% total 3061 709 2352 23% - 0 -

The total swapspace “used” (used = really used + reserved) is the figure in bold. More detail on swap management is in Module 7. For now take the bottom line figure in bold above.

• Which process has the largest Resident Set Size (RSS)?

Answer proc1. You can see that from the global process list in glance (the g key). As you watch it, it will grow until vhand kicks in and limits its RSS. However, the VSS will continue to grow. Select that process (with s) and observe to RSS/VSS figure.

PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt ------------------------------------------------------------------------------- proc1 3267 1 168 root 0.0/ 0.2 1.0 0.0/ 0.0 275.8mb 1 proc2 3268 1 168 root 0.0/ 0.1 0.4 0.0/ 0.0 114.6mb 1 proc3 3269 1 168 root 0.0/ 0.0 0.2 0.0/ 0.0 56.7mb 1 proc4 3270 1 168 root 0.0/ 0.0 0.1 0.0/ 0.0 27.7mb 1 alarmgen 3277 3276 168 root 0.0/ 0.0 0.1 1.3/ 0.1 1.6mb 6 vhand 2 0 128 root 0.4/ 0.2 2.0 81.7/44.2 64kb 1 Resources PID: 3267, proc1 PPID: 1 euid: 0 User: root ------------------------------------------------------------------------------- CPU Usage (util): 0.0 Log Reads : 0 Wait Reason : SLEEP User/Nice/RT CPU: 0.0 Log Writes: 0 Total RSS/VSS :275.7mb/479.1mb System CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 0/ 542 Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0 Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0 Priority : 168 VM Reads : 0 Signals Recd : 0 Nice Value : 20 VM Writes : 0 Mesg Sent/Recd : 0/ 0 Dispatches : 5 Sys Reads : 0 Other Log Rd/Wt: 0/ 0 Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 5 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Tue Apr 6 14:29:16 2004 CPU Switches : 0 Bytes Xfer: 0kb :

• What is the data segment size of the process with the largest RSS?

Answer: select the memory regions page for proc1 with the M key.

Memory Regions PID: 3267, proc1 PPID: 1 euid: 0 User: root Type RefCt RSS VSS Locked File Name ------------------------------------------------------------------------------- NULLDR/Shared 87 4kb 4kb 0kb <nulldref> TEXT /Shared 2 4kb 4kb 0kb /home/.../leak/proc1 DATA /Priv 1 301.0mb 716.2mb 0kb /home/.../leak/proc1 MEMMAP/Priv 1 0kb 16kb 0kb /usr/lib/tztab

Solutions


Solutions-51

MEMMAP/Priv 1 4kb 4kb 0kb <mmap> MEMMAP/Priv 1 4kb 8kb 0kb <mmap> MEMMAP/Priv 1 0kb 8kb 0kb <mmap> MEMMAP/Priv 1 24kb 28kb 0kb /usr/lib/hpux32/libc.so. MEMMAP/Priv 1 40kb 40kb 0kb <mmap> Text RSS/VSS: 4kb/ 4kb Data RSS/VSS:301mb/716mb Stack RSS/VSS: 4kb/ 8kb Shmem RSS/VSS: 0kb/ 0kb Other RSS/VSS:1.6mb/3.2mb

The data segment size in this example is 301/716 MB and growing!

5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs.

• What happens when the process reaches its maximum data size?

Answer This is going to take several minutes. The maxdsiz limit is probably either 256MB or 1GB on the test system. Be careful! maxdsiz is a limit on the VSS (Virtual Set Size) and not the RSS (Resident Set Size). System starts doing a LOT of disk I/O. Look for the large “F” bar in the Disc Util global meter.

• Why does disk utilization become so high at this point?

Answer Kernel is dumping the core file of the user process in our case. You will probably run out of disc space in the /home file system. You may want to remove the /home/h4262/memory/leak/core file! Remember it is not the process that is doing the disk I/O, it is the kernel that is doing it to produce the core file.

6. As the other processes grow towards their maximum data segment size, continue to monitor the following:

• Free memory # vmstat 2 2 procs memory page faults cpu r b w avm free re at pi po fr de sr in sy cs us sy id 2 0 0 321403 91118 54 19 79 285 16 0 359 548 4962 326 2 3 95 2 0 0 321403 90413 1 0 115 12 0 0 0 397 552 191 0 0 100

Not a lot of free memory now. The system is under memory pressure and is paging out to stabilize the memory system • Swap space reserved # swapinfo -tm Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2048 715 1333 35% 0 - 1 /dev/vg00/lvol2

Solutions


Solutions-52

reserve - 341 -341 memory 1013 340 673 34% total 3061 1396 1665 46% - 0 -

Swapspace is up to 46% utilization! • The size of the processes' data segments

All the proc(n) processes continue to grow (see VSS) just like proc1 did and they are aborted in the same way when they cross the line (maxdsiz). • The RSS of the processes The running memory hog processes compete for the limited real memory resource. We didn’t have a lot free at the start of the test and the lab procs all want to grow to the maxdsiz limit. They cannot all fit together so they fight. This is a classic memory thrash situation. • The number of page-outs/page-ins to the swap space

This depends on when you look! These figures were taken while proc2 was still on the move and free memory was approaching its minimum.

# vmstat 2 10 procs memory page faults cpu r b w avm free re at pi po fr de sr in sy cs us sy id 2 0 0 166464 2692 0 0 0 0 0 0 0 103 173 82 0 0 100 2 1 0 170444 1649 0 0 13 0 0 0 0 123 209 92 0 0 100 2 1 0 170444 1028 0 0 8 5 4 0 1256 122 189 88 0 6 94 2 1 0 170444 1146 8 0 6 101 109 0 9869 225 176 129 0 5 95 2 1 0 170444 1392 12 0 5 263 69 0 9659 316 175 112 0 0 100 2 1 0 170444 1366 12 0 5 312 44 0 8186 331 190 156 0 0 100 1 0 0 169455 1090 9 0 5 304 28 0 6410 316 209 201 0 0 100 1 0 0 169455 1112 6 0 3 351 31 0 5334 359 193 163 0 1 99 1 0 0 169455 1048 3 0 2 332 19 0 3902 339 180 133 5 0 95 1 0 0 169455 1600 5 0 0 396 12 0 2576 370 240 119 0 4 96

•

7. Run the two baseline programs, short and diskread.

# timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread

Solutions


Solutions-53

rp2430:

# timex /home/h4262/baseline/short The last prime number is : 49999 real 12.00 user 10.86 sys 0.02 # timex /home/h4262/baseline/diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real 31.79 user 0.02 sys 0.53

rx2600:

# timex /home/h4262/baseline/short & # The last prime number is : 99991 real 8.54 user 8.48 sys 0.00 # timex /home/h4262/baseline/diskread & [1] 3841 root@r265c145:/home/h4262/memory/leak # DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real 29.60 user 0.01 sys 0.16 How does the performance of these programs compare to their earlier runs? Answer: short takes a little longer. The CPU is not under much pressure at this time so compute bound processes will not be affected (unless they need memory!). It is a different story for diskread, in the first test case, it took noticeably longer due to the disk load already in progress for the paging activity. It is not good to have swap space on your application disks!

8. When finished monitoring the behavior of processes with memory leaks, clean up the processes.

Solutions


Solutions-54

• Exit glance. • Execute the KILLIT script:

# ./KILLIT

• If you changed maxdsiz, change it back: # kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000

Solutions


Solutions-55

7–15. LAB: Monitoring Swap Space

Preliminary Steps A portion of this lab requires you to interact with the ISL and boot menus, which can only be accomplished via a console login. If you are using remote lab equipment, access your system’s console interface via the GSP/MP. You may get some “file system full” messages while you are shutting down the system. You can ignore these messages.

Directions

The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities. 1. Use the swapinfo -m command to display the current swap space statistics on the

system. List the MB Avail and MB Used for the following three items:

MB Available MB Used

dev

512 0

reserve

- 139

memory

451 27

Answer Varies with configuration, examples below.

# swapinfo –m (rp2430) Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 512 0 512 0% 0 - 1 /dev/vg00/lvol2 reserve - 139 -139 memory 451 27 424 6% # swapinfo -m (rx2600) Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2048 75 1973 4% 0 - 1 /dev/vg00/lvol2 reserve - 189 -189 memory 1013 339 674 33%

2. To see total swap space “available” and total swap space “reserved”, enter:

# swapinfo -mt

What is the total swap space “available” (including pseudo swap)? Answer Varies with configuration, in our case it is 964 Mb or 3 Gb (as seen in the bolded figures below.)

Solutions


Solutions-56

# swapinfo -tm (rp2430) Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 512 0 512 0% 0 - 1 /dev/vg00/lvol2 reserve - 139 -139 memory 451 27 424 6% total 963 166 797 17% - 0 -

# swapinfo -mt (rx2600) Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2048 74 1974 4% 0 - 1 /dev/vg00/lvol2 reserve - 190 -190 memory 1013 339 674 33% total 3061 603 2458 20% - 0 -

What is the total space "reserved"? Answer Varies with configuration. Swap space is first reserved and then it may (or may not) be used by the process that reserved it. The bottom line is that reserved swap space is no more available than used swap space so the only figure that really matters here are the totals underlined (166 Mb and 603 Mb). This figure is unavailable to any other process.

3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify

whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so let’s not use the –m option.

Upon verification, exit the shell. Is the swap space returned upon exiting the shell process?

Answer It should and it does. But you have to be careful when you look.

It is easy for some other activity on the system to “spoil” the results You may want to try it 2 or 3 times to see if your results change. What SHOULD happen is that the “reserve-USED” entries should increase and then decrease by exactly the same amount.

rp2430:

# swapinfo Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 524288 0 524288 0% 0 - 1 /dev/vg00/lvol2 reserve - 144444 -144444 memory 462248 28384 433864 6% # sh # swapinfo Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 524288 0 524288 0% 0 - 1 /dev/vg00/lvol2 reserve - 144768 -144768

Solutions


Solutions-57

memory 462248 28384 433864 6% # exit # swapinfo Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 524288 0 524288 0% 0 - 1 /dev/vg00/lvol2 reserve - 144444 -144444 memory 462248 28388 433860 6%

rx2600:

# swapinfo Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2097152 75652 2021500 4% 0 - 1 /dev/vg00/lvol2 reserve - 194900 -194900 memory 1037064 346740 690324 33% # sh # swapinfo Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2097152 75652 2021500 4% 0 - 1 /dev/vg00/lvol2 reserve - 195540 -195540 memory 1037064 346740 690324 33% # exit # swapinfo Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2097152 75652 2021500 4% 0 - 1 /dev/vg00/lvol2 reserve - 194900 -194900 memory 1037064 346740 690324 33%

If you see that some swap was reserved and not released, then there is something else going on in the background that is skewing the figures.

4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Util. percentage increases in glance. Type:

# /home/h4262/memory/paging/mem256 & This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Util. in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? Answer Varies with configuration, depends on your swap space.

Solutions


Solutions-58

On the rp2430, after 12 copies of mem256 the test system swap space was almost gone. Below is what happened when the 13th process was introduced.

# swapinfo -tm Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 512 461 51 90% 0 - 1 /dev/vg00/lvol2 reserve - 51 -51 memory 451 399 52 88% total 963 911 52 95% - 0 - # /home/h4262/memory/paging/mem256& [13] 2864 # exec(2): insufficient swap or memory available. [13] + Done(9) /home/h4262/memory/paging/mem256&

On the rx2600, after 37 copies of mem256 the test system swap space was almost gone. Below is what happened when the 38th process was introduced. # swapinfo -tm Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2048 1978 70 97% 0 - 1 /dev/vg00/lvol2 reserve - 70 -70 memory 1013 991 22 98% total 3061 3039 22 99% - 0 - # ./mem256& [38] 4159 exec(2): insufficient swap or memory available.

What prevented an additional mem256 process from being started? Answer “Insufficient swap or memory available” Kill all mem256 processes to restore performance.

5. Recompile the kernel, disabling pseudo-swap. Use the following procedure:

11i v1 and earlier: # cd /stand/build # /usr/lbin/sysadm/system_prep -s system # echo "swapmem_on 0" >> system # mk_kernel -s system # cd / # shutdown -ry 0 11i v2 and later:

# cd / # kctune swapmem_on=0 NOTE: The configuration being loaded contains the following change(s) that cannot be applied immediately and which will be held for the next boot:

Solutions


Solutions-59

-- The tunable swapmem_on cannot be changed in a dynamic fashion. WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? no NOTE: The backup will not be updated. * The requested changes have been saved, and will take effect at next boot. Tunable Value Expression swapmem_on (now) 1 Default (next boot) 0 0 # shutdown –ry 0

6. Reboot from the new kernel. rp2430:

Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test rx2600:

(Nothing special needs to be done)

7. Once the system reboots, login and execute swapinfo.

Is there a memory entry? Why or why not? Answer No. Pseudo-swap has been disabled.

Will the same number of mem256 processes be able to execute as earlier?

Answer No.

How many mem256 processes can be started now? Answer Varies with configuration On the rp2430, only 6 processes could be started successfully. On the rx2600, only 27 processes could be started successfully.

Kill all mem256 processes to restore performance.

8. If you have a two disk system.

If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk, continue this lab at question 13.

Solutions


Solutions-60

If you did not add the second disk earlier, # vgdisplay –v | grep Name (Note the physical disks used by vg00) # ioscan –fnC disk (Note which disk is unused by LVM) # pvcreate –f <raw_dev_file_of_second_disk> # vgextend /dev/vg00 <block_dev_file_of_second_disk> To create the new swap device on the second disk, # lvcreate –n swap1 /dev/vg00 # lvextend –L 512 /dev/vg00/swap1 <dev_file_of_second_disk> Note: In our case the primary swap was 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap. 9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work. # swapon –p 1 /dev/vg00/swap1 Answer: # swapinfo -tm Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 512 0 512 0% 0 - 1 /dev/vg00/lvol2 reserve - 130 -130 total 512 130 382 25% - 0 - # swapon -p 1 /dev/vg00/swap1 swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging. Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use) You need to override. # swapon -p 1 –f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1. Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify “maxswapchunks” and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel (if necessary), to increase maxswapchunks. Use the following procedure:

Solutions


Solutions-61

11i v1 and earlier (ONLY!) # cd /stand/build # echo "maxswapchunks 512" >> system # mk_kernel -s system # cd / # shutdown -ry 0

10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system.

Otherwise, skip to step 11.

11i v1 and earlier (ONLY!) Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test

And now add the new swap device: # swapon -p 1 –f /dev/vg00/swap1

Verify that the new swap space has be recognized by the kernel:

# swapinfo -mt (rp2430) Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 512 0 512 0% 0 - 1 /dev/vg00/lvol2 dev 512 0 512 0% 0 - 1 /dev/vg00/swap1 reserve - 141 -141 total 1024 141 883 14% - 0 -

# swapinfo -tm (rx2600) Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 2048 86 1962 4% 0 - 1 /dev/vg00/lvol2 dev 2048 0 2048 0% 0 - 1 /dev/vg00/swap1 reserve - 158 -158 total 4096 244 3852 6% - 0 -

Done!

11. Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on an rp2430 with 640MB, I found that 8 processes got things paging nicely! On an rx2600, 10 should do nicely. # vmstat 2 2 procs memory page faults cpu r b w avm free re at pi po fr de sr in

Solutions


Solutions-62

sy cs us sy id 9 0 0 180106 5064 34 0 192 340 99 0 3136 339 213 471 100 0 0 9 0 0 180106 5056 23 0 122 217 63 0 2006 216 191 355 100 0 0

Note the system is paging constantly in the vmstat output and free memory is very low.

12. Measure the disk I/O to see what is happening with swap space. Go to question 15 when you have finished. Answer: The I/O should be balanced across both disks! # sar -d 5 2 (rp2430) HP-UX r206c41 B.11.11 U 9000/800 03/18/04 14:22:12 device %busy avque r+w/s blks/s avwait avserv 14:22:17 c1t15d0 87.03 24.73 409 12222 33.45 13.86 c3t15d0 60.68 23.21 406 12093 31.03 9.24 14:22:22 c1t15d0 82.60 22.01 395 12209 28.53 12.26 c3t15d0 72.20 19.57 385 11976 25.00 10.57 Average c1t15d0 84.82 23.39 402 12216 31.03 13.08 Average c3t15d0 66.43 21.43 396 12034 28.10 9.89

# sar -d 5 2 (rx2600) HP-UX r265c145 B.11.23 U ia64 04/07/04 11:28:10 device %busy avque r+w/s blks/s avwait avserv 11:28:15 c2t1d0 9.38 0.50 25 542 0.00 6.05 c2t0d0 3.79 0.50 14 271 0.01 4.71 11:28:20 c2t1d0 21.40 6.75 79 2373 2.85 5.35 c2t0d0 6.60 10.42 47 1229 3.86 3.94 Average c2t1d0 15.38 5.25 52 1456 2.16 5.51 Average c2t0d0 5.19 8.13 31 750 2.97 4.12

This has doubled the effective performance of swap space. The results would be even better if the swap disks were on different controllers.

13. If you have a single disk system. Create three additional swap devices with sizes of 20 MB.

# lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00

Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.

Solutions


Solutions-63

List the current amount of swap space in use. Answer Varies with configuration. Use swapinfo –tm.

14. Activate the newly created swap devices. Activate two with a priority of 1, and the third

with a priority of 2.

# swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on a 640MB system I found that 8 processes got things paging nicely! # vmstat 2 2 procs memory page faults cpu r b w avm free re at pi po fr de sr in sy cs us sy id 10 0 0 175597 6489 12 8 2 31 11 0 467 0 271 58 26 4 70 10 0 0 175597 6414 20 0 27 87 22 0 1316 103 300 254 100 0 0

Note the system is paging constantly in the vmstat output and free memory is very low.

Is the new paging activity being distributed evenly across the paging devices? Answer No. It is confined to lvol2 (primary swap), swap1, and swap3.

15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo-swap and remove the additional swap devices. For 11i v1 and earlier, follow this procedure:

If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out?

A) The distribution would be 10MB and 10MB. or B) The distribution would be 15MB and 5MB.

Answer B. vhand does not consider what the previous utilization was.

Solutions


Solutions-64

# cd / # shutdown –ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown –ry 0

Solutions


Solutions-65

8–18. LAB: Disk Performance Issues

Directions

The following lab illustrates a number of performance issues related to disks. 1. A file system is required for this lab. One was created in an earlier exercise. Mount it

now. # mount /dev/vg00/vxfs /vxfs We also need to assure that the controller does not have " SCSI immediate reporting" enabled. Enter the following command and check your current state: (fill in the device file name as appropriate) # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status)

If the current immediate_report = 1 then enter the following: # scsictl -m ir=0 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear)

2. Copy the lab files to the file system. # cp /home/h4262/disk/lab1/disk_long /vxfs # cp /home/h4262/disk/lab1/make_files /vxfs Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files

3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs

Solutions


Solutions-66

4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key.

From the first window, time how long it takes to read the files with the cat command. Record the results below:

# timex cat file* > /dev/null glance Disk Report real: user: Logl Rds: sys: Phys Rds: Answer: # timex cat file* > /dev/null (rp2430) real: 0.73 user: 0.01 Logl Rds: 2560 sys: 0.11 Phys Rds: 500 # timex cat file* > /dev/null (rx2600) real 0.34 user 0.00 Logl Rds: 2560 sys 0.06 Phys Rds: 2560

5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same command and record the results below:

# timex cat file* > /dev/null glance Disk Report real: user: Logl Rds: sys: Phys Rds: Answer: # timex cat file* > /dev/null (rp2430) real: 0.06 user: 0.01 Logl Rds: 2560 sys: 0.05 Phys Rds: 0 # timex cat file* > /dev/null (rx2600) real: 0.02 user: 0.00 Logl Rds: 2560 sys: 0.02 Phys Rds: 0

Solutions


Solutions-67

NOTE: The conclusion is that I/O is much faster coming from the buffer cache, than having to go to disk to get the data.


Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the VxFS file system (and then removes the files). # timex ./disk_long


Answer: The disk got over 80% busy. The average number of requests in the I/O queue reached around 53 on the rp2430 and 442 on the rx2600. The average wait time of a request was around 65 ms on the rp2430 and 182 ms on the rx2600. The task took around 12.5 seconds on the rp2430 and 7.5 seconds on the rx2600.

7. The glance I/O by Disk report

Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long. Record the results below:

# ./disk_long glance I/O by Disk Report Util: Qlen: Answer: Utilization reached 86% and queue length reached 55 on the rp2430. Utilization reached 85% and queue length reached 414 on the rx2600.

8. The glance I/O by File System report

Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long. Record results below:

# ./disk_long glance I/O by Disk Report Logl I/O: Phys I/O:

Solutions


Solutions-68

Answer: Logical I/Os reached 4059 and Physical I/Os reached 806 on the rp2430. Logical I/Os reached 4702 and Physical I/Os reached 1528 on the rx2600.

9. Performance tuning — immediate reporting. Ensure the immediate reporting options are set for the disk that the file system is located on. If immediate reporting is not set, set it.

# scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache.

# cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs


Exit glance, and in the second window start: # sar -d 5 200

From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files).

# timex ./disk_long

• How busy did the disk get? • What was the average number of requests in the I/O queue? • What was the average wait time in the I/O queue? • How much real time did the task take? How do the results of step 11 compare to the results in step 6?

________________________________________________________________

Solutions


Solutions-69

9–14. LAB: HFS Performance Issues

Directions The following lab illustrates a number of performance issues related to HFS file systems. 1. A 512 MB HFS file system is required for this lab. Use the mount and bdf commands to

determine if such a file system is available.

# mount –v # bdf

If there is no such HFS file system available, create one using the commands below:

# lvcreate -n hfs vg00 # lvextend –L 512 /dev/vg00/hfs /dev/dsk/cXtYdZ (second disk) # newfs -F hfs /dev/vg00/rhfs # mkdir /hfs # mount /dev/vg00/hfs /hfs

2. Copy the lab files to the newly created HFS file system.

# cp /home/h4262/disk/lab1/disk_long /hfs # cp /home/h4262/disk/lab1/make_files /hfs

Next, execute the make_files program to create five 4-MB ASCII files.

# cd /hfs # ./make_files

3. Purge the buffer cache of this data, by unmounting and remounting the file system.

# cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs

Solutions


Solutions-70


# timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real 1.04 user 0.01 sys 0.16 # timex cat file* > /dev/null (rx2600) real 0.45 user 0.00 sys 0.05 The cat command took 1.04 seconds to complete on the rp2430 and 0.45 seconds on the rx2600.

5. In a second window start: # sar -d 5 200

From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files).

# timex ./disk_long • How busy did the disk get? • What was the average number of request in the I/O queue? • What was the average wait time in the I/O queue? • How much real time did the task take? Answer: # sar -d 5 200 (rp2430) HP-UX r206c41 B.11.11 U 9000/800 03/23/04 11:53:15 device %busy avque r+w/s blks/s avwait avserv 11:53:20 c1t15d0 5.20 0.50 13 66 5.09 4.54 c3t15d0 33.60 6922.08 950 15049 629.53 14.85 11:53:25 c1t15d0 7.57 0.50 10 36 5.40 6.82 c3t15d0 55.98 5215.11 1758 27980 2113.38 13.70 11:53:30 c1t15d0 2.01 0.50 6 44 3.92 5.01 c3t15d0 100.00 8156.62 2983 47696 2591.43 16.45 11:53:35 c1t15d0 8.00 5.80 18 108 25.31 18.95 c3t15d0 84.20 1237.19 558 8670 1555.06 17.68

Solutions


Solutions-71

11:53:40 c1t15d0 6.00 0.50 15 76 4.69 4.72 c3t15d0 71.20 7379.94 2168 34537 1322.90 14.77 11:53:45 c1t15d0 0.20 0.50 1 5 0.08 8.35 c3t15d0 25.80 2375.50 950 15206 3478.83 14.42 11:53:50 c3t15d0 9.20 0.50 16 258 5.06 5.21 The disk got up to 100% busy. The average number of requests in the request queue was about 5200. The average wait time in the request queue was about 1950 ms. # timex ./disk_long real 22.76 user 4.57 sys 3.45 The operation completed in 22.76 seconds. # sar -d 5 200 (rx2600) HP-UX r265c145 B.11.23 U ia64 04/07/04 13:20:25 device %busy avque r+w/s blks/s avwait avserv 13:20:30 c2t1d0 4.39 0.50 27 706 0.00 1.67 c2t0d0 27.15 0.50 90 756 0.00 3.04 13:20:35 c2t1d0 41.00 104.29 245 4026 173.18 12.76 c2t0d0 99.20 24004.63 3322 53129 2127.15 2.35 13:20:40 c2t1d0 1.40 0.50 3 51 0.00 4.62 c2t0d0 100.00 20020.69 3895 62320 6436.22 2.04 13:20:45 c2t1d0 4.00 0.50 13 287 0.00 5.68 c2t0d0 57.20 5030.77 2097 33482 9701.92 2.06 13:20:50 c2t1d0 2.40 0.50 7 164 0.00 6.94 13:20:55 c2t1d0 1.40 0.50 2 34 0.00 9.94 The disk got up to 100% busy. The average number of requests in the request queue was about 50,000. The average wait time in the request queue was about 6100 ms. # timex ./disk_long real 16.87 user 0.83 sys 1.96 The operation completed in 16.87 seconds.

Solutions


Solutions-72

6. Performance tuning — recreate the file system with larger fragment and file system block sizes.

Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below:

# lvcreate -n custom-lv vg00 # lvextend –L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs

7. Copy the lab files to the customized HFS file system, execute the make_files program,

and purge the buffer cache.

# cp /hfs/disk_long /cust-hfs # cp /hfs/make_files /cust-hfs # cd /cust-hfs # ./make_files # cd / # umount /cust-hfs # mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs


# timex cat file* > /dev/null real: user: sys:

Answer:

# timex cat file* > /dev/null (rp2430) real 0.84 user 0.01 sys 0.10

Solutions


Solutions-73

# timex cat file* > /dev/null (rx2600) real 0.43 user 0.00 sys 0.03 The cat command took 0.84 seconds to complete on the rp2430 and 0.43 seconds on the rx2600. How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________ Answer: The larger block and fragment size resulted in I/O operations which were almost 20% faster on the rp2430 and marginally faster on the rx2600.

9. Performance tuning — change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties.

# cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs

10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files).

# timex ./disk_long

• How busy did the disk get? • What was the average number of requests in the I/O queue? • What was the average wait time in the I/O queue? • How much real time did the task take? Answer: # sar -d 5 200 (rp2430) HP-UX r206c41 B.11.11 U 9000/800 03/23/04

Solutions


Solutions-74

12:08:22 device %busy avque r+w/s blks/s avwait avserv 12:08:27 c1t15d0 6.20 0.50 9 38 4.18 6.19 c3t15d0 61.20 5592.30 2120 33818 1376.80 13.94 12:08:32 c1t15d0 7.00 0.50 16 81 4.31 5.28 c3t15d0 58.60 7186.64 1675 26765 1295.53 17.00 12:08:37 c1t15d0 8.40 3.94 24 146 20.12 13.03 c3t15d0 92.80 4986.82 1860 29579 2678.62 16.11 12:08:42 c1t15d0 6.60 0.50 17 120 4.84 3.79 c3t15d0 100.00 15588.44 2344 37493 2943.35 16.95 12:08:47 c3t15d0 71.20 5725.86 2292 36664 6159.69 15.69 The disk got up to 100% busy. The average number of requests in the request queue was about 7800. The average wait time in the request queue was about 2900 ms. # timex ./disk_long real 17.17 user 4.61 sys 3.72 The operation completed in 17.17 seconds. # sar -d 5 200 (rx2600) HP-UX r265c145 B.11.23 U ia64 04/07/04 13:39:39 device %busy avque r+w/s blks/s avwait avserv 13:39:44 c2t1d0 1.00 0.50 4 67 0.00 2.51 c2t0d0 46.11 22190.48 1274 20184 1026.94 2.54 13:39:49 c2t1d0 2.00 0.50 5 77 0.00 5.94 c2t0d0 100.00 30303.60 3684 58941 4021.91 2.15 13:39:54 c2t1d0 3.20 5.20 9 141 11.85 12.77 c2t0d0 99.80 11176.41 3888 62008 8740.46 2.05 13:39:59 c2t1d0 0.80 0.50 2 30 0.00 4.42 c2t0d0 5.60 716.00 287 4562 11067.58 1.51 13:40:04 c2t1d0 4.00 0.50 9 43 0.00 4.45 The disk got up to 100% busy. The average number of requests in the request queue was about 17500. The average wait time in the request queue was about 6100 ms. # timex ./disk_long real 14.46 user 0.86 sys 3.04 The operation completed in 14.46 seconds.

Solutions


Solutions-75

How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________ Answer: With fsasync turned on, the operation was about 25% faster on the rp2430 and 14% faster on the rx2600.

Solutions


Solutions-76

10–23. LAB: JFS File System Tuning

Directions

The following lab exercise compares performance of JFS with different mount options. The mount options used with JFS can have a big impact on JFS performance. 1. Mount a JFS file system to be used for this lab under /vxfs.

# mount /dev/vg00/vxfs /vxfs 2. Because the above mount command specified no special mount options, the default

mount options are used. Use the mount -v command to view the default options, including the option for transaction logging type.

What type of transaction logging does JFS use by default? Answer Full logging

3. Change directory to /vxfs. Time the execution of the disk_long program, which writes

400 MB of data to the file system in 20 MB increments. After each 20 MB is written, the files are deleted. Run the command three times and record the middle results. # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long Record middle results:

Real: _____________ User: ____________ Sys: ____________

Answer Varies with configuration, live data from test # timex ./disk_long (rp2430) real 12.34 user 4.82 sys 3.45 # timex ./disk_long (rrx2600) real 9.49 user 0.90 sys 1.62 If you look back to the HFS results, you will see that this is faster. See question 5 from the previous lab; test time was 23 seconds or 17 seconds!

Solutions


Solutions-77

4. Remount the JFS file system using delaylog option. This helps performance of non-critical transactions. Run the command three times and record the middle results.

# cd / # umount /vxfs # mount -o delaylog /dev/vg00/vxfs /vxfs # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long

Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, should be faster than before:

# timex ./disk_long (rp2430) real 10.93 user 4.85 sys 3.52 # timex ./disk_long (rx2600) real 9.23 user 0.90 sys 1.64

Based on the results, does the disk_long program perform any non-critical transactions?

5. Remount the JFS file system using tmplog option. This causes the system call to be

returned after the JFS transaction is updated in memory (step 1 from lecture), and before the transaction is written to the intent log. Run the command three times and record the middle results.

# cd / # umount /vxfs

Answer

The answer is yes; the disk_long program is performing some non-critical transactions. This is seen by some improvement in time to execute. Since the programs write data in 1 MB increments (that's it), just about every JFS transaction is critical, so mounting with delaylog versus full log does not greatly affect performance in this case. It will in other cases.

Solutions


Solutions-78

# mount -o tmplog /dev/vg00/vxfs /vxfs # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long

Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data: # timex ./disk_long (rp2430) real 10.08 user 4.82 sys 3.40 # timex ./disk_long (rx2600) real 10.35 user 0.90 sys 1.60

Based on the results, why does the disk_long program show little or no improvement when mounted with tmplog?

6. Remount the JFS file system using tmpcache option. This allows the JFS transaction to

be created without having to wait for the user data to be written in extending write calls. Run the command three times and record the middle results.

# cd / # umount /vxfs # mount -o mincache=tmpcache /dev/vg00/vxfs /vxfs # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long

Answer

The disk_long program shows little performance improvement because the program is performing extending write calls. When an “extending write” call is issued, by default JFS writes the user data first before writing the JFS transaction to the intent log. As a result, even JFS file systems mounted with tmplog or nolog will still have to wait for the user data to be written to disk. This waiting for the user data to be written hurts the performance of JFS.

Solutions


Solutions-79

Record middle results:

Real: _____________ User: ____________ Sys: ____________

Answer Varies with configuration, live test data. Fastest yet! # timex ./disk_long (rp2430) real 9.13 user 4.51 sys 2.69 # timex ./disk_long (rx2600) real 9.51 user 0.90 sys 1.65

7. Remount the JFS file system using direct option. This option requires all user data and all

JFS transactions to bypass the buffer cache and go directly to disk. Run the command just once and record the results.

# cd / # umount /vxfs # mount -o mincache=direct /dev/vg00/vxfs /vxfs # cd /vxfs # timex ./disk_long

Record results:

Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data, not very impressive! # timex ./disk_long (rp2430)

real 7:36.75 user 5.15 sys 5.41 # timex ./disk_long (rx2600)

Answer

When the mincache=tmpcache option is specified, under 2 MB out of 400 MB is physically written to disk. When this option is not specified, all 400 MB (400 out of 400) is physically written to disk. Major performance improvements should be seen with using this option, especially for applications doing lots of “extending write” calls (like the one in the lab).

Solutions


Solutions-80

real 3:06.72 user 0.90 sys 2.45

Based on the results, why does the disk_long program show such poor performance results when mounted with mincache=direct? When would this option be appropriate to use?

8. Dismount the VxFS file system. # umount /vxfs

Answer

The performance is poor because system calls have to wait while user data and JFS transactions are written out to disk. Normally, the JFS transactions are written to buffer cache, and the system calls do not have to wait for the transaction to be written to disk. This option is appropriate when the application performs its own caching, like with an RDBMS (for example, Oracle).

Solutions


Solutions-81

11–20. LAB: Network Performance

Directions

The following two labs investigate network read and write performance. The labs use NFS and are performed against the JFS file system created in the JFS module.

Lab 1 Network Read Performance

To perform this lab, two systems are needed: an NFS server and an NFS client. Pair up with another student in the class for this lab. 1. Make sure the JFS file system on the server contains the make_files program. Execute

the make_files program to create files for the client to access.

# mount /dev/vg00/vxfs /vxfs # cp /home/h4262/disk/lab1/make_files /vxfs # cd /vxfs # ./make_files

2. Export the JFS file system so the client can mount it.

# exportfs -i -o root=<client_hostname> /vxfs # exportfs

3. From the client system, mount the NFS file system.

# umount /vxfs # mount server_hostname:/vxfs /vxfs

4. Time how long it takes to read the 20 MB of files from the mounted file system. Record the

results:

# timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data below, # timex cat /vxfs/file* > /dev/null (rp2430) real 1.80 user 0.01 sys 0.07 # timex cat /vxfs/file* > /dev/null (rx2600) real 1.17 user 0.00

Solutions


Solutions-82

sys 0.02 5. Now that the data is in the client's buffer cache, time how long it takes to read the exact

same files again. Record the results: # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live data below. Much faster once buffered. # timex cat /vxfs/file* > /dev/null (rp2430) real 0.05 user 0.01 sys 0.04 # timex cat /vxfs/file* > /dev/null (rx2600) real 0.02 user 0.00 sys 0.01

Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help by prefetching data.

6. Test to see if fewer biod daemons will change the initial performance.

# cd / # umount /vxfs # kill $(ps -e | grep biod | cut -c1-7) # /usr/sbin/biod 4 # mount server_hostname:/vxfs /vxfs # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, but significant change. Large sequential access appears to be independent of the number of biods. Not what theory suggests? Well this depends! # timex cat /vxfs/file* > /dev/null (rp2430) real 1.80 user 0.01 sys 0.07 # timex cat /vxfs/file* > /dev/null (rx2600)

Solutions


Solutions-83

real 1.15 user 0.00 sys 0.02

7. Once finished, remove the files and umount the file system.

# rm /vxfs/file* # umount /vxfs

Lab 2 Network Write Performance

The following lab has the client perform many writes to an NFS file system. The following parameters will be investigated:

• Number of biod daemons

• NFS version 2 versus NFS version 3

• TCP versus UDP

During this lab, the monitoring tools shown below should be used on the client and server CLIENT SERVER # nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key)

- monitor Remote Rds/Wrts -

1. From the NFS client, mount the NFS file system as a version 2 file system.

# mount -o vers=2 server_hostname:/vxfs /vxfs

2. Terminate all the biod daemons on the client.

# kill $(ps -e |grep biod|cut -c1-7)

3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file.

# cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs

Record results:

Real: _____________ User: ____________ Sys: ____________

Solutions


Solutions-84

Answer Varies with configuration # timex cp /stand/vmunix /vxfs (rp2430) real 33.95 user 0.00 sys 0.44 # timex cp /stand/vmunix /vxfs (rx2600) real 20.64 user 0.00 sys 0.38

4. Now, start up the biod daemons, and retry timing the copy. Record the results:

# /usr/sbin/biod 4 # timex cp /stand/vmunix /jfs

Record results:

Real: _____________ User: ____________ Sys: ____________

Answer Varies with configuration, the test data shows marked improvement. The biods

are providing the “write behind” service which is reducing the wait time experienced by the cp command.

# timex cp /stand/vmunix /vxfs (rp2430) real 29.27 user 0.00 sys 0.16 # timex cp /stand/vmunix /vxfs (rx2600) real 16.53 user 0.00 sys 0.14

5. Change the mount options to version 3 and retime the transfer:

# cd / # umount /vxfs # mount –o vers=3 server_hostname:/vxfs /vxfs # cd / # timex cp /stand/vmunix /vxfs

Record results:

Solutions


Solutions-85

Real: _____________ User: ____________ Sys: ____________

Answer: Interesting, it would appear that Version 3 mounting is far better than version 2. The results were obtained using the same 4 biods started in question 3. # timex cp /stand/vmunix /vxfs (rp2430) real 2.63 user 0.00 sys 0.18 # timex cp /stand/vmunix /vxfs (rx2600) real 4.13 user 0.00 sys 0.13

6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility.

# ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp

How long did the FTP transfer take? _________

Explain the difference in performance.

Answer The data below shows that ftp is well optimized to perform data transfer. The

good news is that Version 3 of NFS keeps up with it and remember that at 11i, NFS is using TCP/IP and not UDP/IP.

# ftp r265c69 (rp2430) Connected to r265c69.cup.edunet.hp.com. 220 r265c69.cup.edunet.hp.com FTP server (Version 1.1.214.4(PHNE_23950) Tue May 22 05:49:01 GMT 2001) ready. Name (r265c69:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 27573440 bytes sent in 2.55 seconds (10554.31 Kbytes/s) ftp> # ftp r265c145 (rx2600) Connected to r265c145. 220 r265c145.cup.edunet.hp.com FTP server (Revision 1.1 Version wuftpd-2.6.1 Tue Jul 15 07:42:07 GMT 2003) ready.

Solutions


Solutions-86

Name (r265c145:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 47716848 bytes sent in 4.03 seconds (11557.24 Kbytes/s) ftp>

7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP.

# umount /vxfs # mount -o vers=3 –o proto=udp server_hostname:/vxfs /vxfs

Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP?

# timex cp /stand/vmunix /vxfs Answer # timex cp /stand/vmunix /vxfs (rp2430) real 2.44 user 0.00 sys 0.15

# timex cp /stand/vmunix /vxfs (rx2600) real 4.08 user 0.00 sys 0.13

It would appear that UDP is marginally quicker than TCP but the difference is very small and probably not worth the risk. HPUX 11i version 3 NFS with TCP provides good performance and reliability.

81650923 hp ux performance and tuning h4262s

Documents