building an infrastructure for a powerful web presence

7
54 IT Pro November December 2001 1520-9202/01/$10.00 © 2001 IEEE Building an Infrastructure for a Powerful Web Presence Wesley Chou W hen discussing a Web presence, people often mention fancy graphics and enticing layouts. But effective administration of an enterprise’s Web site requires more than creativ- ity and a good set of Web authoring tools. It requires a Web site infrastructure that is based on technology that addresses how people use the site and that appropriately supports such usage. Determining an ideal infrastructure configura- tion for a particular site is as much an art as it is an exact science. Careful and independent weigh- ing of several considerations will assist in select- ing major infrastructure components. KEY CONSIDERATIONS “Know your user” is a simple way to think of some of the key considerations in building an infrastructure. An effective site must meet the demands imposed by traffic volume, the anticipated peak hit rates, the geographic distribution of users, the geographic distance between Web servers and associated data storage networks, traffic characteristics gener- ated by Web page data, and the data volume of Web pages. Before you go about develop- ing an infrastructure, it pays to measure these characteristics.They influence the key components of a Web site infrastructure as much as the costs for materials, administration, and maintenance. Traffic volume, peak rates, and geographic distributions The raw number of site users affects the infra- structure less significantly than does the actual traffic volume generated by those users.For exam- ple, many users viewing a simple, static page could generate less traffic than a few users download- ing large files.The farther users are from servers, the longer the network latency.Several factors— traffic volume, peak hit rates, the geographic dis- tributions of users, and the geographic distance between Web servers and associated data storage networks—help determine the number of Web servers and caches needed to meet the demands of traffic and Web hits. Traffic characteristics Most Web sites provide some combination of the following traffic types: one-way bursty,from user requests to read text or view still images; one-way streaming media, from user requests to receive continuous data streams; • interactive, from user requests to influence server activity by executing CGI (common gateway interface) scripts or posting data; and transactions, a specialized form of interactive traffic in which user requests cause servers to perform some atomic operations on a central Knowing your traffic and users is key to building a solid foundation of Web servers, data storage networks, content switches, and Web caches. Resources Fibre Channel Protocol Inside

Upload: w

Post on 30-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building an infrastructure for a powerful web presence

54 IT Pro November ❘ December 2001 1520-9202/01/$10.00 © 2001 IEEE

Building anInfrastructure for aPowerful Web Presence

Wesley Chou

W hen discussing a Web presence,people often mention fancygraphics and enticing layouts. Buteffective administration of an

enterprise’s Web site requires more than creativ-ity and a good set of Web authoring tools. Itrequires a Web site infrastructure that is based ontechnology that addresses how people use the siteand that appropriately supports such usage.

Determining an ideal infrastructure configura-tion for a particular site is as much an art as it isan exact science.Careful and independent weigh-ing of several considerations will assist in select-ing major infrastructure components.

KEY CONSIDERATIONS“Know your user” is a simple way to think of

some of the key considerations in building aninfrastructure. An effective site must meet thedemands imposed by

• traffic volume,• the anticipated peak hit rates,• the geographic distribution of users,• the geographic distance between Web servers

and associated data storage networks,

• traffic characteristics gener-ated by Web page data, and

• the data volume of Web pages.

Before you go about develop-ing an infrastructure, it pays to

measure these characteristics.They influence thekey components of a Web site infrastructure asmuch as the costs for materials, administration,and maintenance.

Traffic volume, peak rates, and geographic distributions

The raw number of site users affects the infra-structure less significantly than does the actualtraffic volume generated by those users.For exam-ple,many users viewing a simple, static page couldgenerate less traffic than a few users download-ing large files.The farther users are from servers,the longer the network latency. Several factors—traffic volume, peak hit rates, the geographic dis-tributions of users, and the geographic distancebetween Web servers and associated data storagenetworks—help determine the number of Webservers and caches needed to meet the demandsof traffic and Web hits.

Traffic characteristicsMost Web sites provide some combination of

the following traffic types:

• one-way bursty, from user requests to read textor view still images;

• one-way streaming media, from user requeststo receive continuous data streams;

• interactive, from user requests to influenceserver activity by executing CGI (commongateway interface) scripts or posting data; and

• transactions, a specialized form of interactivetraffic in which user requests cause servers toperform some atomic operations on a central

Knowing your traffic and users is key to building a solidfoundation of Web servers, datastorage networks, content switches, and Web caches.

ResourcesFibre Channel

Protocol

Inside

Page 2: Building an infrastructure for a powerful web presence

November ❘ December 2001 IT Pro 55

database—for example, online airlineticket purchases.

Different types of traffic have different qual-ity-of-service (QoS) requirements.One-waybursty traffic can tolerate response time jit-tering,whereas streaming-media traffic can-not. Interactive traffic can tolerate somejitter, but not as much as bursty traffic can.Although a Web site cannot guarantee itsusers end-to-end QoS, the choice of datastorage network must enable the most effi-cient transfer of data between storagedevices and Web servers. Another distinc-tion among the traffic types is that one-waybursty and streaming-media traffic are morecacheable than the interactive traffic types.

Traffic types can also influence Webserver organization. A good practice is tohave a separate server farm handle the dataassociated with each traffic type. For exam-ple, streaming videos could be on one serverfarm, and white papers and references onanother. By organizing Web servers in thisway, a content switch efficiently balancesthe load of Web server requests. Anotherconsideration that can affect Web farmorganization is the type of data generatedby user requests.

Data volume of Web pages Depending on the volume, you can store

Web site data on a single storage device oracross multiple devices. The number ofdevices required to store the data affects the type of datastorage network to select. Also associated with the stor-age of data is the frequency of Web site data updates. AWeb site that updates its page data continuously has dif-ferent needs than a site that remains unchanged for dayson end.

KEY INFRASTRUCTURE COMPONENTSThe considerations just discussed influence how you

choose and use infrastructure components. The four keycomponents of a Web site infrastructure are Web servers,storage networks, content switches, and Web caches.

Web ServersA Web site may use a single Web server,a cluster of servers

called a Web server farm,or multiple server farms.Each Webserver simultaneously services a certain number of users.Rather than storing the actual Web page data locally, theWeb server responds to Web users only after accessing a mas-ter copy of the Web page from a central storage device oraccessing a slave copy from a local cache server.

The size and number of server farms depends on threerelated factors:

• Web site traffic volume and hit rate;• Web page data cacheability; and• geographic distribution of Web site users.

Traffic volume and hit rate dictate the total capacity thatWeb servers must provide. Obviously, the more trafficthere is, the more Web servers are necessary. However, theactual number of required Web servers depends on theability of Web caches to reduce the load on Web servers.Data cacheability refers to the frequency at which the datais requested. On a news Web site, for examples, varioususers repeatedly request the same data. In contrast, asearch engine portal typically generates different data foreach user. So a Web cache would benefit the news Web sitemore than it would the portal.Even if the data is cacheable,however, the extent of Web cache usage depends on thegeographic distribution of users. If the users span widegeographic areas, the Web site can rely on caches to offload

Storage Technologies

➤ SCSI Trade Association, http://www.scsita.org:This organizationtracks the latest SCSI standards.

➤ Technical Committee 10, http://www.t10.org: T10 defines theAmerican National Standard Institute’s version of SCSI.

➤ Storage Network Industry Association, http://www.snia.org➤ Fibre Channel Industry Association, http://www.fibrechannel.org➤ Technical Committee 11, http://www.t11.org/index.htm: T11

defines Fibre Channel standards for ANSI.➤ iSCSI working draft, http://www.ietf.org/internet-drafts/

draft-ietf-ips-iscsi-09.txt

Content Switch Vendors

➤ Cisco Systems, http://www.cisco.com➤ F5 Networks, http://www.f5networks.com➤ Nortel Networks, http://www.nortel.com/index.html➤ Foundry Networks, http://www.foundrynet.com

Cache Engine Vendors

➤ Cisco Systems, http://www.cisco.com➤ CacheFlow, http://www.cacheflow.com➤ Network Appliance, http://www.netapp.com➤ Inktomi, http://www.inktomi.com

Resources

Page 3: Building an infrastructure for a powerful web presence

56 IT Pro November ❘ December 2001

S Y S T E M S

some of the requests directed at the real Web servers.Content and traffic type can also influence the number

of server farms that a site should support. For example, asite may have some Web page content associated with sell-ing CDs and other content associated with selling books.As with logically separating traffic types among serverfarms, organizing server farms based on content type facil-itates efficient content switching. It is typical to considerpartitioning data so that one farm basically services a sin-gle, specific type of content or traffic.

Storage networksAfter determining server farm sizes and organizations,

the next infrastructure component to choose is the dataand Web page storage network. This component defines

how the infrastructure stores dataassociated with a site and distributesthat data from storage devices to thesite’s Web servers.

The infrastructure must store themaster copy of the data for Web pageson reliable storage devices, such asredundant arrays of inexpensive disks(RAIDs).Three popular schemes thatlet Web servers access such devices areSmall Computers System Interface(SCSI)-based Network Attached Stor-age (NAS), Fibre Channel StorageArea Network (SAN), and iSCSI-based networks. (For more informa-tion, see the “Resources” sidebar.)SCSI-based NAS. One establishedway to share data is to directly attacha networked file server to a storagedevice through the SCSI (pronounced“scuzzy”) interface. SCSI originatedyears ago as a parallel bus interfacethrough which servers could attach toI/O devices. It has evolved into a well-established suite of protocols involv-ing command, transport,and interfacedefinitions. There have been manyenhancements to the protocol. Themost recent, Ultra320 SCSI, supportsbursts of 320 Mbytes/s over a maxi-mum cabling of 12 meters. One ofSCSI’s main advantages is its wide-spread support and ubiquity.

However, SCSI requires that only asingle file server be attached directlyto the storage device. Over a tradi-tional LAN (local area network) orWAN (wide area network), servers inthe server farm act as clients to this fileserver.A common practice, for exam-

ple, is to use the Network File System (NFS) applicationover TCP/IP (transmission-control protocol/Internet pro-tocol) to virtually mount the storage device. This type ofclient/server configuration for storage access is known asNAS.This approach is best suited for handling low-volumebursty traffic and least suited for high-volume streamingtraffic. Figure 1 shows an example of a NAS topology.

Virtually all server machine vendors, from Sun to Dell toIBM,sell high-performance machines suited for NAS.WithMicrosoft Windows and the many flavors of Unix all sup-porting NFS, you can purchase a standard server configu-ration and have file sharing up and running right out of thebox.

Advantages. By making use of traditional protocols,NAS provides a quick, easy way to distribute Web content

Disk array withSCSI interface

File server

LAN or WAN

Web servers(clients to file server)

Figure 1. Example NAS topology.

Fibre Channel disk array

Fibre Channel disk array

Fibre Channeldisk array

Fibre Channel disk array

Fibre Channel disk array

Fibre Channelswitch

(a)

(b)

(c)

Figure 2. Fibre Channel topologies.

(a) a point-to-point topologyinvolves exactly two directlylinked Fibre Channel devices;(b) an arbitrated loop connectsdevices to one another; and (c)a fabric connects devices to aFibre Channel switch that arbi-trates traffic flow among de-vices.

Page 4: Building an infrastructure for a powerful web presence

November ❘ December 2001 IT Pro 57

to multiple servers.Moreover,TCP/IP’s ubiq-uitous nature makes connecting heteroge-neous hardware relatively simple. Also,implementing NAS is relatively inexpensive.And, because any LAN/WAN protocol willsuffice for communication, the NAS approachdoesn’t limit the distance between Web andfile servers.

Disadvantages. NAS and TCP are unsuit-able for handling large amounts of streamingmedia—especially if the underlying networkruns at low speeds or is congested and sooffers less predictable performance. Webservers must interface SCSI storage devicesthrough a file server and therefore incur addi-tional latency delays. In addition, each SCSIbus can connect no more than eight or 16devices, thus limiting the amount of data andWeb pages it can support.

Fibre Channel SAN. Defined as an ANSIstandard protocol in 1994,Fibre Channel pro-vides access to storage devices via more of anetwork-centered approach to data storagethan a device-centered approach such asSCSI. It allows shared access of storagedevices, defines three distinct service classes,allows block transfers of up to 128 Mbytes,andsupports three topologies. Figure 2 describeseach of the three Fibre Channel topologies.

Regardless of the specific Fibre Channeltopology, servers have direct access to stor-age devices. The network that connectsservers with the Fibre Channel is known as aSAN. Figure 3 shows an example of a SANtopology. Fibre Channel SANs are mostsuited for handling heavy amounts of stream-ing media and for sites supporting frequentinteractive traffic.

The “Fibre Channel Protocol” sidebar pro-vides more details regarding Fibre Channel.

Advantages. All Web servers can commu-nicate directly with the storage device andtherefore incur lower latency delays thanthose for a NAS.Also,Fibre Channel supportslarge block transfers and offers multiple lev-els of QoS among Web servers and storagedevices. Moreover, it can support many moredevices (up to 128) than the NAS approach.

Disadvantages.The marginal hardware andimplementation costs of the Fibre Channeladapters and switches (if necessary) are sub-stantially higher than the costs of imple-menting a NAS. Administrative costs aretypically higher than for NAS as well,becausesystem administrators are often more familiar

The Fibre Channel protocol falls into four layers. FC-0 definesthe parameters for the physical media. Despite its name, FibreChannel provides a standard for operation—over copper as well asoptical media. FC-1 defines the transmission protocol, includingserial encoding and decoding rules, error correction, and special-character definitions. Fibre Channel does error checking by encod-ing eight bits into 10 bits.

FC-2 serves as the transport layer for Fibre Channel. It providesframing of data, sequencing, CRC (cyclic redundancy checking),and flow control. FC-2 also defines three service classes.

Class 1 (dedicated connection) offers guaranteed in-order deliv-ery.An option of this class, known as intermix, can guarantee a cer-tain bandwidth; in this case, the Fibre Channel protocol lets devicessend class 2 and class 3 traffic only if class 1 is idle. Class 2 (con-nectionless) offers guaranteed delivery but does not guarantee in-order delivery. In class 3 (no delivery guarantee), receivers do notacknowledge the receipt of data. This class is useful for broad-casting.

FC-3 defines Fibre Channel’s special services—for example, mul-ticast support. FC-4 defines a mapping interface that upper-layerapplications use to operate over Fibre Channel. Fibre Channel hasno inherent command set, but the SCSI command set and Internetprotocol are both defined to operate over Fibre Channel.

Fibre Channel Protocol

Disk arraywith Fibre Channel

interface

Application server

Channel network

Web servers

Fibre Channel ring

Figure 3. Example SAN topology.

Data travels via Fibre Channel directly between servers and thedisk array. Control information travels over a separate network andlets Web servers communicate with an application server, whichcontrols access to the disk array. A control network (typically IPbased) lets Web servers communicate with the application server.This control network may be a separate physical network or mayrun IP over the Fibre Channel’s physical media.

Page 5: Building an infrastructure for a powerful web presence

58 IT Pro November ❘ December 2001

with IP networks than with Fibre Channel. Finally, FibreChannel’s greatest allowable cabling distance is 10 km.

iSCSI-based network. A recent Internet standard, iSCSIsacrifices some of the performance of Fibre Channel inexchange for the ease of a NAS solution. (Proponents ofiSCSI argue that with advances in legacy protocol per-formance, iSCSI will soon match Fibre Channel speeds.)This protocol packages upper-layer SCSI commands andwraps them directly into IP packets.The IP packets travelvia a traditional LAN or WAN, as they do with NAS.Thisapproach lets server application software access storagedevices as if they were directly connected over a SCSIinterface.At server sites, special iSCSI drivers encapsulateoutgoing SCSI commands into IP packets and decapsulateincoming SCSI commands from IP. The system accom-plishes encapsulation and decapsulation through softwarealone or through software with the assistance of a specialadapter card.At the data storage site, an iSCSI switch per-

forms similar IP/SCSI translation and com-municates directly with SCSI storage devices.Figure 4 shows an example of an iSCSI topol-ogy.

Advantages. Converting from a SCSI-based NAS installation to an iSCSI configu-ration is relatively inexpensive. BecauseiSCSI-based networks are based on IP, theyare generally easier to maintain than a FibreChannel SAN.Also, the iSCSI configurationcan support considerably more SCSI devices—and hence Web pages and data—than aNAS configuration. Finally, because iSCSIbypasses NFS, its performance is better thanSCSI-based NAS. However, because theseiSCSI-based networks still use traditionalLANs and WANs to transmit traffic, the per-formance improvement is only minor.

Disadvantages. iSCSI is still in its relativelyearly stages. Its ability to handle varied traf-fic, although somewhat improved, is still sim-ilar to that of SCSI-based NAS.

Content switchesThe content switch performs two functions.

First, it directs traffic to a specific server farmby examining each Web request’s content.Second, it balances the load of traffic acrossthat server farm.Again, the type of data pre-sented on the Web site is a major considera-tion. If all server farms access the same datasets, the content switch only performs loadbalancing. However, if a site plans to supportmultiple types of traffic or content, allocatingone server farm for each type of traffic or con-tent may be desirable.

The content switch must be able to distin-guish between HTTP (hypertext transfer protocol) GETrequests for different types of data so that it will knowwhich server farm to direct traffic to. For example, the con-tent switch must distinguish between a GET request to asimple text-only page and one to a rich-text page (a pagethat contains streaming traffic). It will do so by examiningthe URL embedded in the HTTP packet. An enterprisethat plans to supply various types of data and consequentlysupport numerous server farms should consider the rateat which the content switch can parse regular expressions.Content switches can also select server farms by examin-ing other fields in the HTTP header. For example, the con-tent switch can forward a request to a particular serverfarm based on the requesting client’s browser version.

Once the content switch chooses the server farm, it alsomust select the Web server within that farm. The simplestload-balancing schemes are the roundrobin, weightedroundrobin, and least-connections schemes.A more com-

S Y S T E M S

Disk arrays withSCSI interfaces

IP network

Web serverswith iSCSI drivers

iSCSI switch

Figure 4. Example iSCSI topology.

Internet

Server farm 2

Server farm 3

Server farm 1

Contentswitch

Web clients

Figure 5. Example content switch.

Page 6: Building an infrastructure for a powerful web presence

November ❘ December 2001 IT Pro 59

plicated method involves sticky connec-tions,which let requests from a specific userconsistently switch to the same server.Thisscheme is useful in transaction processing,in which a user can make multiple pur-chases from a Web site and thus performseveral HTTP GET requests during a singlesession.At the first request from the user’smachine, the content switch will store eitherthe secure socket layer (SSL) ID or anHTTP cookie associated with the connec-tion. During successive requests, the con-tent switch uses this information to send theHTTP stream to the same Web server. If anenterprise plans on supporting an e-com-merce site that relies on SSL, a switch thatsupports stickiness is essential.

A content switch is a critical componentof a Web site supporting large traffic vol-ume.The variables to consider in choosinga content switch are cost versus perform-ance and feature set. Sites with homoge-neous data have less need to pay for fastURL parsing, for example, whereas siteswith large numbers of transactions needsupport to handle many sticky connections.Figure 5 shows the location of a contentswitch with respect to server farms.

Web cachesThese specialized Web servers store fre-

quently accessed data for server farms.Webcaches used for enhancing access to sitedata fall into two categories: transparentmode and reverse proxy.

Transparent-mode caches should be geo-graphically close to client locations,and pri-marily serve to reduce network latency.Network administrators configure thesecaches to store data associated with the site.For these caches to be effective, a contentswitch must intercept HTTP requests des-tined for the associated site and redirectthose requests to the caches. If the cachereceives a request for data that hasn’t been stored locally,the cache retrieves the data directly from the site, stores it,and responds to the client.A content switch that supportscaches can simultaneously redirect requests to cache clus-ters and balance the load across those clusters. Figure 6shows an example of a transparent-mode cache.

A reverse proxy cache sits closer to the server farmsthemselves. Similar to a transparent-mode cache, itrequires a cache-aware content switch to redirect HTTPtraffic to the caches as opposed to the actual server farm.In this case, however, the reverse proxy cache primarily

serves to reduce the processing load on servers and othercomponents of the data storage network (for example, fileservers). This type of cache is useful, for example, incaching frequently accessed information such as a titlegraphic or introductory page.

Reverse proxy caches are primarily implemented by theenterprises themselves.Transparent-mode caches are usu-ally implemented by Internet service providers (ISPs).Figure 7 shows an example of a reverse proxy cache.

Some third-party vendors provide a service that emu-lates the behavior of transparent-mode caches yet does

Internet

EnterpriseWeb site

Content switch

Web clientsCache-aware

switch

Transparent-mode cache

Figure 6. Example transparent-mode cache.

A transparent-mode cache works in conjunction with a cache-awareswitch. The switch has knowledge of the cache, and is configured tomonitor for data requests directed to certain sites. The switch redi-rects those requests to the transparent-mode cache.

Reverseproxy cache

EnterpriseWeb site

Cache-awarecontent switchWeb clients

Internet

Figure 7. Example reverse proxy cache.

In this configuration, the switch in front of the site must be aware ofthe cache. The switch redirects client requests for certain data to thecache, thus reducing the processing load of the Web servers.

Page 7: Building an infrastructure for a powerful web presence

60 IT Pro November ❘ December 2001

not require special switch configurations. In this case,Webmasters (with vendor assistance) must modify the sitedata to respond to client requests with HTTP redirect mes-sages that instruct the user’s browser to actually retrievedata from the geographically closer cache. Such an optionis attractive to smaller enterprises,which lack the resourcesto influence an ISP to alter its network configuration. Alarger enterprise can influence an ISP to provide Webcaches and place them close to the site users.

The entire question of caching depends on the enterprisedata’s cacheability.Again,enterprises with primarily static,read-only data benefit the most from caching. Such enter-prises should look at their client base to ascertain the per-centage of clients that would benefit from a cacheimplementation. In contrast, any enterprise that constantlyupdates its Web page data or anticipates extensive inter-active traffic could find limited utility in invoking the extracosts of supporting a cache.

I n today’s world of real-time information dissemination,a strong Web presence is essential. For an enterprise toobtain such a presence, it must consider the type of data

to support, the size and demographics of its Web clientbase, and the frequency with which it must update the dataon its site. If an enterprise considers all these factors, it canbuild the proper Web site infrastructure. Customers willthen be able to quickly and efficiently perform e-com-merce transactions and access product and service infor-mation on the enterprise’s site. This in turn will lead toincreased customer satisfaction and ultimately increasedrevenue. �

Wesley Chou is a software engineer at Cisco Systems. Con-tact him at [email protected].

S Y S T E M S

IT Professional is looking for contributions about the following cover feature topics for 2002. Submit articles to [email protected].

Jan./Feb. Knowledge ManagementCompanies that invested in knowledge sharing and gathering initiatives are taking a hard

look at return on investment.Mar./Apr. Enterprise Databases

Now at the core of several critical systems, databases deserve careful attention in the datamodeling stages.May/June Network Security

Find out what basic security measures you should be taking to protect your system fromintruders and attacks.July/Aug. IT Infrastructure

Building systems to fit a cohesive architecture and system organization can save you fromsome IT headaches.Sept./Oct. Managing Software Projects

Are your software projects threatening to get out of hand? Let our experts tell you how tokeep them in check.Nov./Dec. Information Resources Management

Juggling scarce resources will be key to surviving the next several months as business looksfor a recovery.

2002EDITORIAL CALENDAR2002EDITORIAL CALENDAR