system

1

System & Network Administration

•Chapter 3 – Service

By Chang-Sheng Chen (200803011)

2

Contents of Chapter 3

3.1 The Basics3.1.1 Customer

Requirements

3.1.2 Operational Requirements

3.1.3 Open Architecture

3.1.4 Simplicity

3.1.5 Vendor Relation

3.1.6 Machine Independence

3.1.7 Environment

3.1.8 Restricted Access

3.1.9 Reliability

3.1.10 Single or Multiple Servers

3.1.11 Centralization and Standards

3.1.12 Performance

3.1.13 Monitoring

3.1.14 Service Rollout

3.2 The Icing3.2.1 Dedicated Machine

3.2.2 Full Redundancy

3.3 Conclusion

3

The Basics

• The most important thing to consider at all stages of design and deployment is the customers’ requirements.– Talk to the customers and find out what their needs

and expectations are for the services.

• Then, build a list of other requirements that are only visible to the SA team.– Focus on what, rather than how.

• Service should be built on server-class machines that are kept in a suitable environment.

4

The Basics (cont.)

• Access to server machines should be restricted to SAs for reasons of reliability and security.

• An SA has several decisions to make when building a service.– Choosing vendors and products ( software, ha

rdware)– Reliability, performance, etc.

5

The Basics (cont.)

• Most services rely on other services. – Understanding in detail how a service

works will give you insight into the service on which it relies.

• For example, almost every service relies on name service (DNS). DNS relies on network, and therefore, anything that relies on DNS also relies on network.

• A service should be built as simple as possible, with as few dependencies as possible, to increase reliability and make it easier to support and maintain.

6

The Basics (cont.)

• Another method of easing support and maintenance is to use standard hardware/software, standard configurations and have documentation in a standard location.

• A key part of implementing any new service is to make it independent of the particular machine

7

3.1.1 Customer Requirements

• When building a new service, you should always start with the customer requirements.

• Gathering the customer requirements– There are very few services that do not have

customer requirements.• DNS, authentication services, etc.

• A Service Level Agreement (SLA)– An SLA enumerate the services that will be

provided and the level of support they receive.

8

Service Level Agreement(cont.)

• A Service Level Agreement (SLA)– An SLA enumerate the services that will be

provided and the level of support they receive.– It typically categories problems by severity

and commits to response times for each category.

– The SLA usually defines an escalation process that increases the severity of a problem if it has not been resolved after a specified time and calls for managers to get involved if problems are getting out of hand.

9

Service Level Agreement (SLA)

• The SLA process is a forum for the SAs to understand the customers’ expectations and to set them appropriately, so that the customers can understand what is and is NOT possible and why.– It is a tool to plan what resources will be required.– The SLA should document the customers’ requiremen

ts and set realistic goals for the SA teams in terms of features, availability, performance, and support.

– It should document future needs and capacity so that all parties will understand the growth plans.

10

3.1.2 Operation Requirements

• The SA team may have other requirements for the new service that are not immediately visible to the customers.– The administrative interface, whether it interoperates

with other existing services and can be integrated with central service such as authentication or directory services.

– SAs also need to consider how the service scales. – A related consideration is the upgrade path for the ser

vice– The level of reliability– Network performance issues– Monitoring issues ( availability, performance, etc.)– Budget issues

11

Operation Requirements (cont.)

• Questions about an upgrade process:– Does it involve an interruption of service ?– Does it involve touching every desktop ?– Is it possible to rollout the upgrade slowly, to test it on

a few willing people before inflicting it on whole organization ?

• Try to design the service, so that upgrades are easy, can be performed without service interruption, don’t require touching the desktops, and can be rolled out slowly.

12

3.1.3 Open architecture

• Whenever possible, a new service should be built around open protocols and file formats.– Any service with an open architecture can be

more easily integrated with other services that follow the same standards.

• The business case for using open protocols is simple: – it lets you build better services because you can

select from the best server and client, rather than being forced to pick, for example, the best client and then getting stuck with a less than optimal server.

13

Open architecture (cont.)-The ability to decouple the client and server selections

• A better way to select protocols based on open standards ad permit each side (i.e., client and server) to select their own software.– Customers are free to choose the software that best fi

ts their own needs, biases, and even platforms.– SAs can independently choose a server solution base

d on their needs for reliability, scalability, and manageability.

• The SAs can now choose between competing server products, rather than being locked into the (potential difficult to manage) server software and platform required for a particular client application.

– Open protocols provide a level playing field that inspire competition between vendors, which benefits you.

14

Open architecture (cont.)• Open protocols and file formats are typical quite static

(or only change in upward compatible ways) and widely support, – giving you the maximum product choice and maximum chance

of reliable, interoperable products.• The other benefit of using open systems is that you don’t

require a gateway to the rest of world.– Gateways are additional services that require capacity

planning, engineering, monitoring, and everything else mentioned in this chapter

• Case Study– Hazards of Proprietary Email Software

• Primarily based on client user interface and features (e.g., Graphic User Interface, etc.) and no concerns for server management, reliability and scalability

– All messages from all users in a single large file– Protocol Gateway Reduce Reliability

• Microsoft Exchange Server

15

3.1.4 Simplicity

• When architecting a new service, simplicity should be your foremost consideration.– The simplest solution that satisfying all the requirements will be t

he most reliable, easiest to maintain, easiest to expand, and easiest to integrate with other systems.

– As the system grows, it will become complex. Therefore, starting out as simple as possible delays the day when a system has become too complex.

• Sometimes, one or two requirements from the customer or SAs may add considerably to the complexity of the system.– Reevaluate the importance of these requirements

• These requirements could be met, but at a cost to reliability, support levels, and on-going maintenance.

16

3.1.5 Vendor Relations• When choosing hardware and software for a service, you

should be able to talk to sale engineers from your vendors to get advices on the best configuration from your application. – Hardware vendors sometimes have product configurations that

are tuned for particular applications, such as database or web server.

• If there is more than one server vendor in your environment, and it seems that more than one of your vendors has an appropriate product, You should use the situation to your advantage.– Get those vendors biding against each other

• the same price for more performance, reliability, or scalability• Get a better price and be able to invest the surplus

– Even if you know which vendor you will choose, don’t let them know that you have decided until you are convinced that you have the best deal possible.

17

3.1.5 Vendor Relations (cont.)

• When choosing a vendor, particularly for software product, it is important for you to understand the direction in which the vendor is taking the product.– For key, central service, such as authentication or

directory services, it is essential to stay in touch with the product direction, or you may suddenly discover that the vendor no longer supports your platform.

• If possible, try to stick to vendors who develop the product primarily on the platform you use, rather than port it to other platform.– Having fewer bugs, receiving new features first, and

better support, etc.

18

3.1.6 Machine Independence

For Name-based Service (Ch.6 Name Service)

• Clients should always access a service using a generic name that is based on the function of the service.– E.g., Smtp.nctu.edu.tw, pop3.nctu.edu.tw

• The machine should never have a primary machine name that is functional-based,– because ultimately the function may need to move to

another machine. For example,• Primary name: DcMg.nctu.edu.tw• Alias (service) name: smtp.cc.nctu.edu.tw

19

3.1.6 Machine Independence (cont)

For IP address based services,

• we could also use some techniques (such as layer 4 switching) to give the machine that the service runs on multiple virtual IP addresses in addition to the primary real IP address.– Then the virtual address and the service can

be moved to another machine relatively easily.

20

3.1.7 Environment

• A fundamental piece of building a service is providing a reasonable high level of availability, which means placing all the equipments associated with that service into a data center (cf. Ch.17).– A data center provides protected power, plenty of cooling,

controlled humidity (vital in dry or damp climates), fire suppression, and a secure location where the machine should be free from accidental damage or disconnection.

– In addition, a server often needs much high speed network connections (e.g., high-speed links, more interfaces) than its clients because it needs to be able to communicate at reasonable speeds with many clients simultaneously.

• High-speed network cabling and hardware typically are expensive to deploy

21

Environment (cont.)

• None of the components of the service should rely on anything than runs on a machine that is not located in the data center.– The service is only as reliable as the weakest link in the chain of

the components that need to be working for the service to be available.

– If that is the case, find a way to change the situation:• Move the machine into a data center• Replicate that service onto a data center machine• Remove the dependency on the less reliable machine

• Case Study– Hazards of servers relying on Non-servers

• NFS automount

22

3.1.8 Restricted Access

• Restricting server access to the SA team from the beginning is the best approach to ensure reliability and expected performance levels.– There should be no reason for anyone to log in to a

server other than an SA performing administrative work on the server.

• The fewer people who log in to a machine, the more stable it is.

– If a customer can and becomes accustomed to logging in to a particular server, he probably will start running other jobs on it that take CPU and I/O cycles away from the services, without realizing that he is adversely affecting the service.

• E.g., NFS server

23

3.1.9 Reliability• If you have redundant hardware available, use it as

effectively as you can.• The single most effective way to make a service as

reliable as possible is to make it as simple as possible.– Find the simplest solution that meets all the

requirements.• When you are building a service at a central location that

will be accessed from remote sites, it is particularly important to take network topology into account.– If connectivity to the main site is down, can the

service still be available to remote sites ?• Some, Yes stale name service, authentication service• Others, No, database or file service

24

3.1.10 Single or Multiple Servers

• Independent services (or daemons) should always be on separate machines, if cost and staffing-levels permitting.– However, if the service that you are building is

actually composed of more than one new application or daemon and the communication between those components is over a network, you need to consider whether to put all of the components on one machine or to split them across many machines.

• E.g., a website with a database, a mail system with many filtering mechanisms (e.g., anti-spam, anti-virus, etc.)

– The choice may be determined by security, performance, or scaling concerns.

25

Single or Multiple Servers (cont.)

• In other cases, one of the components will initially only be used for this one application, but may later be used by other applications. E.g., – calendar service + LDAP server (Initially)– Mail service + LDAP server (later)– …

• If a service, such as LDAP, may be used by other services in the future, it should be placed on dedicated machines, – so that the calendar service can be upgraded and

patched independently of the (ultimately more critical) LDAP servers.

26

Single or Multiple Servers (cont.)

• Sometimes, two applications or daemons may be completely tied together and will never be used apart from each other. – In this situation, it makes sense to put them

both on the same machine.– E.g., mail server + DNS caching server

• Video Streaming Server– Encoding, Streaming Server

27

3.1.11 Centralization and Standards• An element of building a service is centralizing the tools,

applications, and services that your customers’ need.– Centralization ( 集中化 / 單一窗口 ) means that the tools, a

pplications, and services are primarily managed by one central group of SAs on a single central set of servers.

– Support for these services is provided by a central helpdesk.

• Centralizing services and building them in standard ways make them easier to support and lower training costs.– The service should be designed and documented in some

consistent way, so that the SA answering the support call knows where to find everything and thus can respond more quickly.

28

Centralization and Standards (cont.)

• Centralization does not preclude centralizing on regional or organization boundaries, particularly if each region or organization has its own support staff.– Some services, such as e-mail, authentication

services and networks, are part of the infrastructure and need to be centralized.

• For large sites, these services can be built with a central core that feeds information to and from distributed regional and organizational systems.

– Other services, such as file services and CPU farms, are more naturally centralized around departmental boundaries.

29

3.1.12 Performance

• From a customer’s view, two things are important in any service:– “Does it work ?” and “Is it fast ?”

• When designing a service, you need to pay attention to its performance characteristics, – even though there may be many other difficult

technical challenges to overcome.• Performance expectations increase constantly

as networks, graphics, and processors get faster.– To build a service that performs well, you need to

understand how it works and perhaps look at ways of splitting it effectively across multiple machines.

30

3.1.12 Performance (cont.)

• Performance expectations increase constantly as networks, graphics, and processors get faster.– Performance that is acceptable now, may not be six

months or a year from now.

• To build a service that performs well, you need to understand how it works and perhaps look at ways of splitting it effectively across multiple machines.– You also needs to consider how to scale the

performance of the system as usage and expectation rise above what the initial system can do.

31

3.1.12 Performance (cont.)

• When choosing the servers that run the service, consider how the service works.– A lot of disk I/O ?

• More disk read than write (or vice versa)

– Keeping large tables of data in main memory ?• Lots of fast memory and larger memory caches

– A network-based service that sends large amount of data to clients or between servers ?

• Multiple dedicated servers with high-speed interfaces, clusters of servers, etc.

32

Performance (cont.)

• Case Study– Bad capacity planning makes a bad first

impression– Performance at remote sites (i.e., over wide

area links)• Web site (e.g., different content for Modem, T1,

High speed links, etc.) – Solution: Proxy server (HTTP accelerator )

• Handset windows vs. computer windows

33

Performance at remote sites

• Performance of the service for remote sites may also be an issue.– In some cases, quality of service or intelligent queuing

mechanisms can be sufficient to make performance acceptable.

• E.g., mail relays/forwarders, web proxies, etc.

– In others, you may need to look at ways for reducing the network traffic.

• Different content on a web system for different speed of links (e.g., text-only versions for low-speed links (modem, T1) and graphical versions for high-speed links, etc.)

34

3.1.13 Monitoring (Ch.24)

• A service is not complete and cannot be called a service unless it is being monitored for availability, problems, and performance and there are capacity planning mechanisms in place.– The helpdesk, or front-line support group, must be

automatically alerted to problems with the service so that they can start fixing them before too many people are affected by these problems.

– Likewise, the SA group should monitor the service on an ongoing basis from a capacity planning standpoint.

• E.g., network bandwidth, server performance, transaction rates, license and physical device availability, etc.

35

Monitoring Example- Statistics for mail.nctu.edu.tw

36

Monitoring Example (cont.)

37


38


39

3.1.14 Service Rollout( 首次推出 )

• Make sure the customers’ first impression are positive.– The rollout and the customers’ first experiences with

the service will color the way that they view the service in the future.

• One of the key pieces of making a good impression is having all of the documentation available, the helpdesk familiar with and trained on the new service, and all the support procedures in place.– There is nothing worse than having a problem with a

new application and finding out that no one seems to know anything about it when you look for help.

40

3.1.14 Service Rollout (cont.)

• The rollout also includes building and testing a mechanism to install new software and configuration settings that are needed on each desktop.– One-some-many technique

• One Some Many

• Ideally, no new desktop software or configuration should be required for the service, because that is less disruptive for your customers and reduce maintenance,– but installing new client software on the desktops is frequently.– E.g., enabling IEEE 802.1x authentication scheme, web browser

(IE vs. Firefox)

• New Trend– Example: SSL VPN vs. PPTP VPN

41

3.2 The Icing

3.2.1 Dedicated Machine3.2.2 Full Redundancy

– E.g., Name Service & Authentication Services

– Primary vs. Secondary (duplicate) set of servers

• Failed-over, backup

– Tightly coupled vs. loosely-coupled servers

• Load-sharing, performance-increasing

42

3.2.1 Dedicated Machine

• Having dedicated machines for each service– More reliable– Debugging easier when there are reliability pr

oblem– Outage ( 暫時中斷服務 ) more limited in scope,– And upgrades and capacity planning much ea

sier

43

Dedicated Machine (cont.)

• Sites that grow from a small company to a larger one generally end up with one central administrative machine.– Eventually, this machine will have to be split up and

the services spread across many servers because of increased load.

– IP address dependencies are the most difficult to deal with when splitting services from one machine to many.

• Name service (e.g., DNS, NIS), Security service (e.g., router of firewall rules ), etc.

44

3.2.3 Full Redundancy

• Consider which services will benefit your customers most to have completely redundant and start there. – Name service and authentication services are

typically the first services to have full redundancy.• They are designed for secondary servers• they are so critical

– Other critical services, such as e-mail, printing, and networks, tend to be considered much later because they are more complicated or more expensive to make completely redundant.

45

Full Redundancy

• Another benefit of full redundancy– It makes upgrade procedure easier.

• A “rolling update” can be performed

• Case Study: Design Email services for Reliability– Incoming mail path vs. Outgoing mail path

• Mail relays vs. mail routing hosts• Mail delivery hosts

– Firewall

46

Appendix

• Background - Internet Applications

• Networking Troubleshooting Process

• Case Study: – E-mail system operations and design

considerations– Security events

47

Background - Internet Applications

48

Truth Depends on Interpretation (e.g., Anti-spam or anti-virus mail filtering)

FilteringFilteringwithwith

HH11(msg)(msg)

FilteringFilteringWithWith

HH22(msg)(msg)

Mail Spool

Discard

•MTA1 (or MUA1)

Accept•MTA0

•MTA = Mail Transfer Agent•MUA = Mail User Agent

•MTA2 (or MUA1)

49

Internet

•Bouncing server

•Incoming SMTP Gateway Farm

• Mail Spool server

•Outgoing SMTP Gateway Farm

• Firewall

典型 E-mail 系統運作圖

•SMTPauth

Mail Filtering•BL/GL/WL•Auto-learn

50

MTA LDA

Mailspool

Internet InternetPOP3/IMAP

server

MUA

user Mailstorage Anti-virus

programs

Incoming Flow of a Typical Mail System

•User PC

•procmail•sendmail

•Netscape, MS-outlook, etc.

•A Typical Mail System

51

•Incoming SMTP Gateway•SMTP

•POP3/IMAP

SMTP•Outgoing SMTP Gateway•source

•destination

Firewall,filtering

Firewall,filtering

Generic E-mail Transmission Path

1 2 3

45

6

52

A Hybrid Model for Anti-spam -- Generic Mail Filtering

GenericMail

Filtering

White List

Black List

AutomaticSPAM Learning

Reject

Mail Spool

•AcceptGrey List

(1)

(2)

(3)

(4)

Pass

Pass

Fail

Fail

Failtemporarily

Client

Update•Discard

•Bounce

53

Sample Statistics anti-spam in mail.TN.edu.tw

(http://ms2.tn.edu.tw/report/day/ )

AllMsg.

73%

Rejected

27%Greylist

25%

ClamAVSpamAssassin

2%

Virus3% 17%

5%

PassedBlockedSpamLevel > 16

PassedSpamLevel (6-15)

54

Networking Troubleshooting Process

DNS_b

DNS_a

SMTP_a

SMTP_b

DNSFiltering

DNSFiltering

Router/SwitchFiltering

Router/SwitchFiltering

SMTP Filtering

SMTP Filtering

ClientRouter_a

Router_b

55

Port-scanning summary on DNS servers of neighbor sites

56

竹苗區網 DNS server 入侵事件- Sample scenario

• 2000 年 , 某校在區網的網域 , 登錄有兩個 DNS servers– 不過 , 從一開始 , 就只有建立一個 server

• 該 server-A 有 security hole, 被外來者闖入– 入侵者 , 持續透過該 server-A, 嘗試入侵國外網站– 由於該單位未設立 abuse, postmaster 等標準聯絡信箱 , 且該機

器的 root mail 根本無人處理– 網域上層 , 持續收到國外不同地方轉來的抱怨與求助 e-mail

• 從區網的 router 統計數字 , 發現該校有大量異常的 DNS 流量– 往國外地區 ( 東歐 )

57

•DNS Serverfarm

.com

.arpa

Others

•DNS server

•Caching-only

www, proxySMTP

Layer-1 Layer-2

•Ordinary client

Multiple outgoing paths and distributed DNS

Internet

ISP-2

ISP-1

58

Traffic Amplifying Attacks via DNS Zone TransferTraffic Amplifying Attacks via DNS Zone Transfer

A: AttackerA: Attacker

V: attacked site ( Victum)V: attacked site ( Victum)

D1D1 D2D2DnDn

Q: zone transferQ: zone transferDn: n -Dn: n ->some large number>some large number

Q(n)Q(n)

R(n)R(n)R(1)R(1) R(2)R(2)

Q(1)Q(1)

59

Common Terms

• Reliability ( 可信度 , 可靠性 ) --From Wikipedia, – In general, reliability (systemic def.) is the ability of of

a person or a system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances.

– The IEEE defines it as ". . . the ability of a system or component to perform its required functions under stated conditions for a specified period of time."

http://en.wikipedia.org/wiki/Institute_of_Electrical_and_Electronics_Engineers

60

Common Terms• In telecommunications and

reliability theory, the term availability has the following meanings:– 1. Simply put, availability is the proportion of ti

me a system is in a functioning condition.• Note 1: The conditions determining operability and

committability must be specified.• Note 2: Expressed mathematically, availability is

1 minus the unavailability.

http://en.wikipedia.org/wiki/Telecommunication

http://en.wikipedia.org/wiki/Reliability_theory

http://en.wikipedia.org/wiki/Unavailability

61

Common Terms• In telecommunications and reliability theory, the

term availability has the following meanings:2. The ratio of (a) the total time a functional unit is

capable of being used during a given interval to (b) the length of the interval.

– Note 1: An example of availability is 100/168 if the unit is capable of being used for 100 hours in a week.

– Note 2: Typical availability objectives are specified either in decimal fractions, such as 0.9998, or sometimes in a logarithmic unit called nines, which corresponds roughly to a number of nines following the decimal point, such as "five nines" for 0.99999 reliability.

http://en.wikipedia.org/wiki/Telecommunication

http://en.wikipedia.org/wiki/Reliability_theory

http://en.wikipedia.org/wiki/Functional_unit

http://en.wikipedia.org/wiki/Nine

62

Definition of availability

• Barlow and Proschan [1975] define availability of a repairable system as "the probability that the system is operating at a specified time t."

• Representation– The most simple representation for availability is a

s a ratio of the expected value of the uptime of a system to the aggregate of the expected values of up and down time, or

system

Documents

aservice level agreement

sla enumerate theservices

service level agreement

data center

open protocols

speed links

ways ofsplitting

multiple servers