rhel6_scalability_wp

8/6/2019 RHEL6_Scalability_WP

http://slidepdf.com/reader/full/rhel6scalabilitywp 1/14

www..m

Whitepaper

Red Hat enteRpRise Linux 6 scaLabiLity

Scalailit is one of the major areas of focus in Re Hat® Enterprise Linux® 6. The importance

of scalailit is riven the convergence of several factors, which have comine to create a

strong customer eman for a high-performance, prouction-rea Linux infrastructure.

Re Hat Enterprise Linux is now enterprise-proven as the environment for the most emaning

enterprise applications. Re Hat Enterprise Linux has emonstrate the power, reliailit, an

roustness emane prouction environments. A goo example of this is the increasing

use of Re Hat Enterprise Linux to run SAP, one of the most emaning enterprise application

suites.

Customer eman for large Linux sstems is riven the migration from existing large UNIX

sstems running on proprietar RISC architectures to Linux running on stanar harware.

This is riving the eman for large, highl capale Linux sstems to replace aging large, highl

capale UNIX sstems. Another factor is growing eman for IT resources at virtuall all compa-nies. Most of this eman is for large numers of small to mi-range sstems, either lae serv-

ers or rack-mounte servers.

Primaril ase on the x86-64 architecture, these two-socket or four-socket sstems now

inclue 8-64 processors or cores, support a terate of memor, an inclue high-performance

I/O. Toa, a single processor contains 4-16 cores, an this will continue to grow over the next

several ears. What was an extremel large sstem a few ears ago is a mi-range sstem toa–

an will e an entr-level sstem in a few more ears. Further, the large numers of these ss-

tems create challenges for networking, storage, an management.

Thus, Re Hat Enterprise Linux 6 must support the harware capailities an customer

emans of toa while eing positione to support the greatl enhance harware capailities–

an customer emans - of the coming ecae.

2 Sclbly: toy nd pcc

3 Sclbly Dvs

3 64-it Support

4 Processor Count

5 Ticket Spinlocks

5 Tickless Kernel

6 Split LRU VM

7 Control Groups

8 File Sstems

10 Share Storage File Sstems

11 RAS

taBLe OF CONteNtS

eXeCUtiVe SUMMarY



Whitepaper

2www..m

REd HAT ENTERPRISE LINUX 6 SCALAbILITy

Scalability: theory and Practice

There are two components to scalailit: What are the architectural limits of the operating ss-

tem, an how well oes the sstem run prouction workloas. While architectural limits get the

most pulicit, the real question is how well the sstem runs prouction workloas.

To see the ifference etween theor an practice, look at the Re Hat Enterprise Linux sup-

porte sstem limits page at www.d.com/l/com/. Here ou can see oth the theo-

retical limits and the certied system limits. The theoretical limits are the architectural design

limits of each version of Re Hat Enterprise Linux–the values that can’t e exceee. The certi-

ed system limits are the values that have been proven on actual production hardware running

real applications.

There is a famous saing: “In theor, there is no ifference etween theor an practice. In

practice there is.” Red Hat applies both theory and practice to ensure that certied systems

actuall meet customer expectations.

As an example, consier what happene when sstems supporting 128Gb of memor ecame

available and Red Hat increased the certied maximum memory in Red Hat Enterprise Linux5 from 64Gb to 128Gb. Since oth are well within the Re Hat Enterprise Linux 5 architectural

limit of 1Tb, this shoul not require more than plugging in the new, larger memor.

However, there are man sutle aspects of memor management. Going from 64Gb to 128Gb

oules the numer of memor pages that must e manage. Unexpecte performance issues

can arise in ifferent application workloas, requiring changes to the virtual memor suss-

tems. Memor management, uffer management, sstem utilities, an applications ma all nee

to be tuned or modied to provide the expected performance with the larger memory. Without

this tuning, applications ma actuall slow own with the larger memor.

With man ears of experience supporting prouction workloas, Re Hat unerstans that

raising the architectural limits is the starting point, not the en point, in supporting large ss-

tems. This causes Re Hat to e somewhat conservative when new harware comes out–it isn’t

enough for it to just work; it has to work right an meet customer expectations. The onl wa to

achieve this is with real harware, real applications, real workloas, an soli engineering.

This means that Red Hat Enterprise Linux 6 will be introduced with some certied system limits

set below the architectural limits, and that the certied system limits will grow as larger systems

ecome availale an Re Hat has the opportunit, together with its OEM partners, to o the

engineering an tuning work require to make them work well.

Let’s look at some of the changes in Re Hat Enterprise Linux 6 to support the new harware

capailities an customer emans, focusing on the x86-64 architecture.

there iS a

FaMOUS SaYiNg

“In theor, there is no if-

ference etween theor an

practice. In practice there is.”

Re Hat applies oth theor

an practice to ensure that

certied systems actually

meet customer expectations.



Whitepaper

3www..m

REd HAT ENTERPRISE LINUX 6 SCALAbILITy

Scalability driverS

The ig rivers of scalailit are 64-it support, processor count, memor size, I/O, an

resource management. Re Hat Enterprise Linux 6 aresses each of these areas.

64-b Suo

Toa’s processors are “64-it”, which shoul give them the ailit to use 18 Exates of mem-

or (1.845 x 1019 tes). However, for a variet of reasons, neither harware nor software actu-

all supports this much memor.

Red Hat Enterprise Linux 5 supports up to 1TB (theoretical) and 256GB (certied). Some sys-

tems have been certied with more memory than the certication limit on an exception basis–

this requires aitional testing an valiation an is hanle on a case--case asis.

Red Hat Enterprise Linux 6 supports up to 64TB of memory (theoretical). The initial certied

limits will e smaller than this; Like Re Hat Enterprise Linux 3, 4, an 5 efore it, the Re Hat

Enterprise Linux 6 certied memory limits will grow over time. Ongoing development and tuning

will e performe to ensure effective use of large (an expensive) memor.

Examples of this work are huge pages an transparent huge pages (oth implemente in

Re Hat Enterprise Linux 6). To explain these, consier how memor is manage in terms of

locks of memor, known as pages. These memor pages are traitionall 4096 tes. This

means that 1Mb of memor is mae up of 256 pages. Likewise, 1Gb of memor is mae of

256,000 pages, an 1Tb of memor is 256 million pages. There is a memor management unit

uilt into the harware that contains a list of these pages, with each page reference through a

page tale entr.

Harware an memor management algorithms that work well with thousans of pages (mega-

bytes of memory) have difculty performing well with millions or billions of pages. This is espe-

ciall critical since the harware memor management unit in moern processors onl supports

hunres or thousans of page tale entries–when more memor pages are use, the sstem

falls ack to slower software-ase memor management.

Effectivel managing large amounts of memor requires either increasing the numer of page

tale entries in the harware memor management unit, which is expensive an causes other

performance issues, or increasing the page size. The most popular sizes are 2Mb an 1Gb, which

are commonl referre to as huge pages. The 2Mb page tales scale well to multiple Gb of mem-

or, an the 1Gb page tales scale well to terates of memor.

While these larger page tales help with large amounts of memor, the require changes to

sstem management an some applications. Huge pages must e assigne when the sstem is

booted, are difcult to manage, and often require signicant changes to applications to be used

effectivel.

Transparent huge pages (THP) implement the huge pages ut automate most aspects of creat-

ing, managing an using the huge pages. Thus, THP hie a lot of the complexit of using hugepages from the sstem aministrator an application evelopers. Since THP is aime at perfor-

mance, the THP evelopers have one consierale work to test an tune across a wie rang

of systems, system congurations, applications, and workloads. While special system tuning

still prouces the est results, transparent huge pages provie excellent performance improve-

ments even with stock settings. Again, note the ifference etween theor an practice.

An aitional complexit is that man new sstems are uilt with NUMA (Non-Uniform Memor

Access). While NUMA greatly simplies designing and building the hardware for large systems,

it makes life more challenging for operating sstem engineers an application programmers.



Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy

4www..m

The ig change is that memor on a NUMA sstem ma now e local or remote, an it can take

several times longer to access remote memor than local memor. This has man performance

implications that impact operating sstem esign, applications, an sstem management.

Sstem esigners have conclue that NUMA is the onl feasile approach to uiling large ss-tems–or even meium-size sstems–so we can expect to see man more NUMA sstems uring

the Re Hat Enterprise Linux 6 life.

Much work has gone into Linux–especiall in Re Hat Enterprise Linux 6–to optimize for NUMA

an to provie tools to manage users an applications on NUMA sstems. This inclues sstem

changes such as CPU afnity, which tries to prevent an application from unnecessarily moving

between NUMA nodes. This signicantly improves performance. Another tool is CPU pinning,

which allows a program or a system administrator to bind a running application to a specic

CPU or set of CPUs. These tools, an others, make the ifference etween a large sstem that

runs well an one that runs poorl.

pocsso Coun

Re Hat Enterprise Linux 5 supports up to 255 processors (theoretical) an 64 processors (cer-tied). Red Hat Enterprise Linux 6 supports up to 4,096 processors (theoretical). Note that,

from an operating sstem perspective, a core or a hperthrea counts as a processor.

The operating sstem requires a set of information on each processor in the sstem. Through

Red Hat Enterprise Linux 5 this is done in a simple way, by allocating a xed size array in mem-

or containing the information for all processors. Information on an iniviual processor is

otaine inexing into that arra. This is fast, eas, an straightforwar. For relativel small

numbers of processors it works very well. For larger numbers of processors it has signicant

overhea.

The xed size array in Red Hat Enterprise Linux 5 is 255 processors and is a single shared

resource. This can ecome a ottleneck if large numers of processes on large numers of pro-

cessors need to access it at the same time. Third, it is inexible. Adding a single new item to the

processor information becomes very difcult, and may be impossible without breaking the exist-

ing interfaces.

Re Hat Enterprise Linux 6 aresses this moving to a namic list structure for processor

information. The list is allocate namicall–if there are onl eight processors in the sstem,

onl eight entries are create in the list. If there are 2,048 processors in the list, 2,048 entries

are create.

The list structure allows a ner granularity of locking–if, for example, information needs to be

upate at the same time for processors 6, 72, 183, 657, 931, an 1546, this can e one with

greater parallelism. Situations like this oviousl occur much more frequentl on large sstems

than small sstems.

Further, a numer of changes an extensions can now e mae to the processor information

without reaking application compatiilit.

This is a major change that has require a lot of work to implement. In aition to implement-

ing the new list structure, ever component that touches the processor information ha to e

upate. Extensive testing was require. An, of course, performance tuning was neee.

The net result of all these changes is that Re Hat Enterprise Linux 6 can support the new ss-

tems planne over the next several ears. An we have the founation for supporting trul large

sstems, if processor manufacturers ecie to start uiling 64,000 processor sstems.

rheL 6 aND NUMa

RHEL 6 is optimize for

NUMA an provies the

tools neee to manage

users an applications on

NUMA sstems.

Re Hat Enterprise Linux 6

can support new sstems

planne over the next several

ears. An we have the

founation for supporting trul

large sstems, if processor

manufacturers ecie to

start uiling 64,000

processor sstems.




5www..m

tck Snlocks

As mentioned, NUMA system architecture greatly simplies hardware design, but places new

emans on software. A ke part of an sstem esign is ensuring that one process oesn’t

change memor eing use another process. data corruption an sstem crashes are theinevitale result of uncontrolle changing of ata. This is one allowing a process to lock a

piece of memor, perform an operation, an then unlock the memor (or free the lock). A com-

mon metho for oing this is a spin lock, where a process will keep checking to see if a lock is

availale an take the lock as soon as it ecomes availale. If there are multiple processes com-

peting for the same lock, the rst one to request the lock after it has been freed gets it. When all

processes have the same access to memor, this approach is fair an works quite well.

Unfortunatel, on a NUMA sstem, not all processes have equal access to the locks. This results

in processes on the same NUMA noe as the lock having an unfair avantage in otaining the

lock. Processes on remote NUMA noes experience lock starvation an egrae performance.

Re Hat Enterprise Linux 6 aresses this issue through a mechanism calle ticket spinlocks,

which as a reservation queue mechanism to the lock. This means that processes that nee to

take the lock will essentiall “get in line” an e allowe to take the lock in the orer that the

requeste it. Timing prolems an unfair avantages in requesting the lock are eliminate.

While a ticket spinlock has slightl more overhea than an orinar spinlock, it is much more

scalale an provies etter performance on NUMA sstems.

tcklss Knl

Another major avance in Re Hat Enterprise Linux 6 is the tickless kernel. Previous versions

use a timer-ase kernel, which ha a clock running that prouces a sstem interrupt or timer

tick several hunre or several thousan times a secon (epening on what the timer is set to),

even when the sstem has nothing to o. Each time the timer prouces an interrupt, the sstem

polls–it looks aroun to see if there is an work to o. If so, it oes it. If not, it goes ack to sleep.

On a lightl loae sstem, this impacts power consumption preventing the processor from

effectivel using sleep states. The sstem uses the least power when it is in a sleep state. There

are several sleep states, with the eeper sleep states requiring even less power. However, the

sleep states require time an power for the sstem to enter them an leave them.

The most efcient way for a system to operate is to do work as quickly as possible and then go

into as deep a sleep state as possible and sleep for as long as possible. But it is very difcult to

get a goo night’s sleep when the sstem timer is constantl waking ou up.

The answer is to remove the interrupt timer from the ile loop an go to a completel interrupt-

riven environment. This allows the sstem to go into eep sleep states when it has nothing to

o, an respon quickl when there is something to o. This removal of timer ticks from the ile

loop prouces what is calle the tickless kernel.

The most efcient way for

a sstem to operate is to o

work as quickl as possile

an then go into as eep a

sleep state as possile an

sleep for as long as possile.

But it is very difcult to get a

goo night’s sleep when the

sstem timer is constantl

waking ou up.




6www..m

Sl LrU VM

One of the secrets to performance is to keep things that ma e use close to the processor. All

actual work is one in the CPU registers, so the goal is to make sure that the next instruction or

piece of ata neee the CPU can e loae into the registers as quickl as possile. Sincethere is a trae-off etween spee an size, there is a hierarch of progressivel larger–ut

slower–storage. This hierarchy typically goes rst level cache (L1 cache), second level cache (L2

cache), thir level cache (L3 cache, inclue on man new processors), main memor on local

numa noe, main memor on remote numa noe, local storage (isk), an then remote storage.

Data available in the rst level cache can be accessed millions of times faster than data on

remote storage–but remote storage may be millions of times larger than the rst level cache.

Much of the memor on a running sstem is use for sstem manage caches–copies of infor-

mation kept where the can e accesse in microsecons rather than millisecons. Since the

sstem usuall oesn’t know what piece of information will e neee next, a variet of algo-

rithms are use to preict what will e neee. This inclues keeping in memor information

that has previousl een use–statisticall, if a piece of information has een use once, it is

likel to e use again. data that is written to isk is hel in memor until it has phsicall een

written to isk, an the cop is left in memor as long as the memor isn’t neee for another

purpose–man applications will write ata to isk an then promptl rea it ack to make fur-

ther changes. Other algorithms will rea ahea from storage; If some information has een rea

from a le, there is a high probability that that the next data needed will be the next data in in

the le.

Since there is only a nite amount of memory available, the system quickly gets to the point

where existing ata must e iscare efore new ata can e place in memor. This presents

no prolems, since this ata is a cache (cop) of the original ata. The onl impact to iscaring

it is that it will take aitional time to access it if it is neee again.

Thus, a ke part of sstem performance is how well the sstem takes care of preicting what

ata is likel to e use next, ringing in new ata (sometimes efore it is neee), an gettingri of ol cache ata to make room for new ata.

One of the most powerful algorithms for etermining what ata to iscar is to get ri of the

ata that hasn’t een use in the longest time–the least recentl use (LRU). This approach

keeps track of when each piece of ata in the various caches is use. This is one tagging

pages of memor, an upating the tag each time ata in that page is use (rea or written).

The sstem can then scan through the the cache pages of memor, iscar the pages that

haven’t een use in a long time (evict the pages), an replace them with newer ata–a process

calle page replacement. Or, if an application requests more memor, the LRU algorithm is an

excellent wa to ecie which pages can e iscare to free up memor for the application.

LRU is one of the core algorithms in Linux virtual memor management, an is a vital element

of sstem performance.

As powerful as the LRU algorithm is, there is room for performance improvement. The ol page

replacement mechanism ha two major issues: First, it woul sometimes evict the wrong pages,

causing these pages to have to e rea in again. This woul often occur when the pages that

shoul e evicte were hien ehin other pages in the LRU list, causing the sstem to evict

the pages it could nd, rather than the pages it should evict.




7www..m

The secon issue is that the sstem woul repeatel scan over pages that shoul not e

evicte. For example, on a sstem with 80Gb anonmous pages, 10 Gb of page cache, an no

swap, the ol algorithms woul scan the 80Gb of anonmous pages over an over again to get

at the page cache. This results in catastrophic CPU utilization an lock contention on sstemswith more than 128Gb of memor.

Starting from the premise that not all ata is equal, a Re Hat engineer implemente a set of

patches that handle different types of pages differently and nds pages that can be evicted with

minimal scanning. These patches were, of course, pushe upstream an accepte into the Linux

kernel efore eing inclue in Re Hat Enterprise Linux 6. The result is the Split LRU VM (Split

least recentl use virtual memor manager).

The Split LRU VM uses several lists of memor pages instea of a single, monolithic memor

manager. These include separate page lists for lesystem backed data (the master data exists in

a le in the storage subsystem and can be read again whenever needed), swap backed data (the

VM can page out memor to isk an rea it ack in when neee), an non-reclaimale pages

(pages that can not e iscare the VM).

There are also signicant improvements to locking, making the system more scalable for large

numers of processors an large amounts of memor.

The end result is that the Split LRU VM in Red Hat Enterprise Linux 6 delivers a signicant

improvement in sstem performance, especiall for large sstems.

Conol gous

We have alrea note how small sstems are ecoming large sstems, with a two-socket server

or lae now incluing 32 CPUs–an growing. Man simple approaches to managing sstem

resources that worked ne with one processor–or even four processors–do not work well with 32

or more processors.

Re Hat Enterprise Linux provies man options for sstem tuning that work quite well. Large

sstems, scaling to hunres of processors, can e tune to eliver super performance. but

tuning these systems requires considerable expertise and a well-dened workload. When large

sstems were expensive an few in numer, it was acceptale to give them special treatment.

Now that these sstems are mainstream, more effective tools are neee.

Further complicating the situation is the tren to use these more powerful sstems for consoli-

ation an placing the workloas that ma have een running on four to eight oler servers

onto a single new server.

Let’s explore the situation for a moment. A sstem with a single CPU can e effectivel utilize

with a single process. A sstem with four CPUs requires at least four processes to take avan-

tage of it an avoi wasting sstem resources. A sstem with 32 CPUs requires a minimum of

32 processes (one per CPU), an is likel to nee several hunre processes to keep the overall

sstem reasonal us.

Man moern applications are esigne for parallel processing, an use multiple threas or pro-

cesses to improve performance. However, few applications can make effective use of more than

eight to ten threas or processes. Thus, multiple applications tpicall nee to e installe on a

32-CPU sstem to keep it us.

Split LRU VM in Re Hat

Enterprise Linux 6 elivers

a signicant improvement in

sstem performance, espe-

ciall for large sstems.




8www..m

Altogether, we are at a place where small, inexpensive mainstream sstems have all the capaili-

ties of the large sstems of a few ears ago, multiple applications are eing consoliate onto a

single server, it isn’t cost-effective to spen the same amount of expertise an tuning that was

previousl eicate to supporting the large sstems, an application workloas have ecomemuch more variale.

Further, some resources–such as isk I/O an network communications–are share resources

that are not growing as fast as CPU count. The result is that an application, or a process within

an application, can consume excessive resources an egrae the performance of the whole

sstem.

A virtualization into this mix, an ou have an urgent nee for etter was to control our

sstems.

Control groups, or cgroups, are a metho for comining sets of tasks an allocating an manag-

ing the amount of resources that the are ale to consume. For example, ou can take a ata-

ase application an give it 80 percent of four CPUs, 60 Gb of memor, an 40 percent of isk

I/O into the SAN. A we application running on the same sstem coul e given two CPUs, 2 Gb

of memor, an 50 percent of availale network anwith.

The result is that oth applications eliver goo performance an o not excessivel consume

system resources. Further, the system is signicantly self-tuning, in that changes in workload

are not likely to signicantly degrade performance.

Cgroups o this in three phases: First, a cgroup is create an a task or set of tasks is assigne

to it. These tasks run within the cgroup. Further, an tasks that are spawne are also in the

cgroup. This means that the entire application can e manage as a unit.

Secon, a set of resources are allocate to the cgroup. These resources inclue cpusets, mem-

or, I/O resources, an network resources.

Cpusets allows assigning a number of CPUs, setting afnity for specic CPUs or nodes (a node

is generally dened as a set of CPUs or cores in a socket), and the amount of CPU time that cane consume. Cpusets are vital for making sure that a cgroup provies goo performance, that

it oes not consume excessive resources at the cost of other tasks, an that it is not starve for

CPU resources it nees.

I/O anwith an network anwith are manage other resource controllers. Again, the

resource controllers allow ou to etermine how much anwith the tasks in a cgroup can

consume, an ensure that the tasks in a cgroup neither consume excessive resources nor are

starve for resources.

The result is that an application developer or system administrator can dene and allocate, at

a high level, the sstem resources that various applications nee an will consume. The sstem

then automaticall manages an alances the various applications, elivering goo preictale

performance an optimizing the performance of the overall sstem.

Fl Sysms

When Red Hat Enterprise Linux 5 rst shipped, 500GB drives had just been introduced and most

rives were in the 100-200Gb range. Most servers at that time woul have storage in the 1-2 Tb

range. Toa 2Tb rives are common, 3Tb rives are ecoming availale, an 4Tb rives are

expecte to e wiel availale in 2011.

We can expect sstems to commonl have 10’s of Tb of storage–or more. Further, SAN ase stor-

age can e expecte to approach 100 Tb.

Control groups, or cgroups,

are a metho for comining

sets of tasks an allocatingan managing the amount of

resources that the are ale

to consume. For example, ou

can take a ataase applica-

tion an give it 80 percent of

four CPUs, 60 Gb of memor,

an 40 percent of isk I/O

into the SAN. A we applica-

tion running on the same ss-

tem coul e given two CPUs,

2 Gb of memor, an 50

percent of availale network

anwith.




9www..m

As with other areas, lesystems have both theoretical and practical aspects of scaling, which we

will explore in some etail.

t eXt Fl Sysm Fmly

Ext3 or the third extended lesystem is the long-standing default lesystem in Red Hat

Enterprise Linux. It was introuce in Re Hat Enterprise Linux 2.1 an has een the efault

lesystem for all subsequent releases through Red Hat Enterprise Linux 5. It is well-tuned for

general purpose workloads. Ext3 has long been the most common lesystem for enterprise dis-

triutions an man applications have een evelope on Ext3.

Ext3 supports a maximum lesystem size of 16TB, but practical limits may be lower. Even on a

1TB S-ATA drive, the performance of the Ext3 lesystem repair utility (fsck), which is used to ver-

ify and repair the lesystem after a crash, is extremely long. For many users that require high

availability, this can further reduce the maximum feasible size of an Ext3 lesystem to 2-4TB of

storage.

Ext4 is the fourth generation of the EXT lesystem family, and is the default in Red Hat

Enterprise Linux 6. It supports a maximum lesystem size of one exabyte, and a single le maxi-mum size of 16 Tb. Ext4 as several new features:

• Extent-ase metaata

• delae allocation

• Journal check-summing

• Support for large storage

Extent-based allocation is a more compact and efcient way to track utilized space in a le-

system. This improves lesystem performance and reduces the space consumed by metadata.

Delayed allocation allows the lesystem to put off selecting the permanent location for newly

written user data until the data is ushed to disk. This enables higher performance since it

allows the lesystem to make this decision with much better information.

Additionally, the lesystem repair time (fsck) in Ext4 is much faster than in Ext2 and Ext3. Some

lesystem repairs have been demonstrated a six-fold speedup.

While the Ext4 lesystem itself is theoretically capable of supporting huge amounts of storage,

the maximum supporte limit in Re Hat Enterprise Linux 6 is 16 Tb. Work is neee across a

range of sstem tools such as fsck an performance tuning tools to take avantage of the ai-

tional capacit of Ext4.

For lesystems larger that 16TB, we recommend using a scalable, high-capacity lesystem such

as XFS.




10www..m

t XFS Fl Sysm Fmly

XFS is a robust and mature 64-bit journaling le system that supports very large les and le-

systems on a single host. As we mentioned above, journaling ensures lesystem integrity after

system crashes–for example due to power outages--by keeping a record of lesystem operations

that can be replayed when the system is restarted and the lesystem remounted. XFS was origi -

nall evelope in the earl 1990s SGI an has a long histor of running on extremel large

servers an storage arras. XFS supports a wealth of features, incluing, ut not limite to:

• delae allocation

• dnamicall allocate inoes

• b-tree inexing for scalailit of free space management

• Ailit to support large numers of concurrent operations

• Extensive run-time metaata consistenc checking

• Sophisticate metaata rea-ahea algorithms

• Tightl integrate ackup an restore utilities

• Online efragmentation

• Online lesystem growing

• Comprehensive iagnostics capailities

• Scalale an fast repair utilities

• Optimizations for streaming vieo workloas

While XFS scales to exabytes, Red Hat’s maximum supported XFS le system image is 100TB.

Given its long histor in environments that require high performance an scalailit, it is not

surprising that XFS routinely is measured as one of the highest performing lesystems on largesstems with enterprise workloas. For instance, a large sstem woul e one with a relativel

high numer of CPUs, multiple HbAs, an connections to external isk arras. XFS also performs

well on smaller sstems that have a multi-threae, parallel IO workloa. XFS has relativel poor

performance for single-threae, meta-ata intensive workloas--for example, a workloa that

creates or deletes large numbers of small les in a single thread.

XFS is availale with the Scalale File Sstem A-On for Re Hat Enterprise Linux 6.

Sd So Fl Sysms

Shared storage lesystems, sometimes referred to as cluster lesystems, give each server in

the cluster irect access to a share lock storage evice over a local Storage Area Network

(SAN). Shared storage lesystems work on a set of servers that are all members of a cluster.

Unlike network le systems such as NFS, no single server provides access to data or meta-datato other memers: each memer of the cluster has irect access to the same storage evice

(the “shared storage”) and all cluster member nodes access the same set of les.

Cache coherency is paramount in a clustered lesystem to ensure data consistency and integ -

rity. There must be a single version of all les in a cluster that is visible to all nodes within a

cluster. In orer to prevent memers of the cluster from upating the same storage lock at the

same time, which causes data corruption, shared storage le systems use a cluster-wide locking

XFS is a roust an mature

64-bit journaling le

sstem that supports verlarge les and lesystems

on a single host.




11www..m

mechanism to arbitrate access to the storage. For example, before creating a new le or writing

to a le that is opened on multiple servers, the lesystem component on the server must obtain

the correct lock.

The most common use of cluster lesystems is to provide a highly available distributed service–

for example, an Apache we server. An memer of the cluster will see a full coherent view of

the data stored in the global lesystem, and all data updates will be managed correctly by the

istriute locking mechanisms.

Cluster lesystems perform well with workloads where each node writes primarily to non-shared

les or where shared les are almost entirely read-only. An example of the rst case would be

a scientic data capture application, where each node is reading a separate stream of data and

writing this to a le that everyone can read. An example of the second case would be a web ser-

vice where multiple noes are reaing a share ataase.

Red Hat Enterprise Linux 6 provides the GFS2 clustered lesystem, available with the Resilient

Storage A-On, which is tightl integrate with Re Hat Enterprise Linux High Availailit clus-

tering, availale with the High Availailit A-On. This provies excellent support for high-per-

formance, high-availailit, scalale, mission-critical applications.

L Boo Dvs

As previousl mentione, 2Tb rives are wiel availale now, 3Tb rives are ecoming avail-

ale, 4Tb rives will e here soon, an rive venors continue to invest in new technolog.

Further, wiesprea use of RAId means that it is common to comine 8-12 phsical rives into a

single large an reliale logical rive.

Through Re Hat Enterprise Linux 5, the largest oot rive that can e supporte is slightl over

2.2Tb. This is ue to the limitations of the bIOS an its master oot recor (MbR) ase isk

partitioning. Larger partitions can e use with GPT (GUId Partition Tale), a moern replace-

ment for MbR, which allows partitions up to 9.4 Zetabtes (9.4 x 1021 tes). Re Hat Enterprise

Linux 5 supports the use of GPT on ata isks, ut bIOS-ase ooting requires the use of the

MbR, thus limiting Re Hat Enterprise Linux 5 to 2.2Tb oot partitions.

Red Hat Enterprise Linux 6 supports both BIOS and UEFI (Unied Extensible Firmware

Interface). UEFI is a replacement for bIOS, which is esigne to support new an emerging har-

ware. The bIOS was create for the original IbM PC. While it has evolve consieral, the bIOS

is ecoming prolematic for moern harware. Re Hat Enterprise Linux 6 on sstems with

UEFI allows use of GPT an larger than 2.2Tb partitions for oth the oot partition an ata

partitions.

raS

In aition to the performance aspects of scalailit, it is also important to look at RAS–reliail-

it, availailit, an serviceailit. RAS ensures that a sstem functions correctl, provies ser-

vices when neee, an can e maintaine or repaire easil with little impact on operation.

The goal of reliailit is to ensure that the sstem is functioning correctl an oes not eliver

erroneous results. Several approaches are used to enhance reliability. The rst of these is to

avoi errors, uiling reliale harware an software. A goo starting point is specifing

high-qualit components, especiall for power supplies, fans, capacitors, memor, an connec-

tors (CPUs generall o not have ifferent qualit graes). Operational factors are also impor-

tant, such as running a processor elow its maximum spee an ensuring goo cooling an

well-conitione power. A highl reliale operating sstem such as Re Hat Enterprise Linux is

vital.

Re Hat Enterprise Linux 6

provies the GFS2 clusterelesystem, available with the

Resilient Storage A-On,

which is tightl integrate

with Re Hat Enterprise Linux

High Availailit clustering,

availale with the High

Availailit A-On. This

provies excellent support

for high-performance, high-

availailit, scalale, mission-

critical applications.




12www..m

There are limitations on how reliale an single piece of harware can e mae, so the next step

is to anticipate failures when esigning harware--to etect errors, recover from errors, an

continue. An example of this is the ECC memor use in servers. ECC memor has the ailit to

etect multiple it errors an to correct single it errors an continue, allowing the sstem tocontinue to run with man tpes of memor errors. Toa, ECC-stle approaches are use insie

the CPU, across I/O channels, an even across networks.

Another example is RAId storage. disk rives are one of the most failure prone parts of comput-

ers. Engineers propose that the est wa to eal with isk failures was to use reunant isks,

so that the failure of a isk woul not cause an loss of ata. Furthermore, with proper esign,

the sstem coul continue running while the a isk was replace.

Perhaps the ultimate example is fault-tolerant computing, where ever element of the computer

from CPU an memor to I/O an storage is uplicate. Not onl is ever component uplicate,

ut all operations are snchronize an performe in lock step. An harware that fails is auto-

maticall turne off, an the uplicate harware continues operation.

While these examples can e implemente in harware, the true power of RAS is elivere

when harware, operating sstems, an applications cooperate to eliver a roust sstem.

Consierale work has gone into Re Hat Enterprise Linux 6 oth Re Hat an harware ven-

ors to enhance RAS support.

A simple example of this is stateless operations, such as we servers. It is common to use mul-

tiple we servers, for oth performance an availailit. If a we server fails, requests are sent

to other we servers. If a we server fails while servicing a request, a simple page refresh in the

we rowser will re-run the transaction.

If it isn’t possile to correct the error an continue, it is critical to etect the error an prevent

it from propagating. If this is not one, fault results an corrupte ata can occur. detecting a

harware error is calle a machine check, an ealing with harware errors is the responsiilit

of the machine check architecture (MCA). MCA provies the framework for reporting, logging,

an hanling harware errors.

For man tpes of errors, the sstem can re-tr the operation that faile. If the re-tr succees,

the sstem can continue operation. Of course, the error shoul e logge, to give ou the oppor-

tunity to identify trends and x problems before they occur.

The simplest wa eal with uncorrectale harware errors is to immeiatel stop the sstem

when an error occurs–to crash the sstem. While this is a rute-force approach, it is etter than

allowing errors to continue. This allows work to fail over to another sstem, to reoot the sstem,

or to repair the sstem an then reoot it. The MCA logs an crash umps provie valuale tools

for etermining wh the sstem crashe, whether it was a harware or software prolem, an

input on how to x the system.

More avance versions of MCA, implemente in new harware an Re Hat Enterprise Linux 6,

implement a ne-grained approach. Instead of crashing the entire system, they can disable spe-

cic hardware components–CPUs, sections of memory, NICs, storage controllers, or other com-

ponents. These components will not e use until the are returne to service. This allows the

sstem to continue operating even with harware failures.

As an example, a failure might occur with a NIC in a PCIe hot-swap slot. Further, assume that

there are two NICs in the sstem which are one together so that the share network traf-

c and appear to the system as a single logical NIC. (Bonded NICs can double the network

anwith availale from a single NIC.) The sstem (harware plus Re Hat Enterprise Linux 6)

woul mark the NIC as having an error an isale it. The error woul e reporte to the sstem




13www..m

aministrator. The NIC coul e remove an replace with a new one, while the sstem was

running. Finall, the new NIC coul e starte an ae ack to the sstem. during this entire

process, the sstem woul continue to function normall, with the onl visile sign eing a

reuction in network performance. This case requires support from sstem harware, NIC har-ware, NIC river, an the operating sstem. It also, of course, requires support the sstem

aministrator an repair team.

A similar situation applies to multi-path storage. It is common to have two or more paths into

a SAN, for reliailit an performance. The multi-path I/O susstem is esigne to accommo-

date failures in the bre channel links, switches, and storage controllers. As described above for

NICs, a hot-swap bre channel adapter experiencing a hardware error can identied, stopped,

replace, an re-starte with no impact on sstem operation.

Other types of problems have been more difcult to deal with. Failures in CPUs and mem-

or have traitionall rought the entire sstem own. However, this isn’t alwas necessar.

Especiall on larger sstems, there is a goo possiilit that a CPU or memor failure will onl

impact one (or a few) applications, not the operating sstem itself. In this case it is feasile to

stop (“crash”) onl the applications affecte the failure, mark the failing harware so that itwon’t e use, an continue running the sstem.

In man cases, the affecte applications can e immeiatel restarte with minimal owntime.

The nal part of the RAS story is for the system to provide more details on exactly what and

where the harware prolem is. Toa’s sstems ma have over 100 memor moules, ozens

of I/O controllers, an ozens of isk rives. Traitional repair approaches, such as swapping out

memory until the problem goes away, simply aren’t sufcient.

This is wh the new sstems on’t just report “memor error” or “isk error.” Instea the will

report “uncorrectale memor errors in dIMM slots 73,74, an 87” or “fatal error in evice e2,

pci slot 7.” This information tells ou exactl which evices have faile an nee to e replace

an where the are locate.

The comination of new technologies in processors, I/O susstems, I/O evices, rivers, an

Re Hat Enterprise Linux 6 elivers sstems that are more reliale, have higher availailit,

continue to operate in the presence of errors which woul crash earlier sstems, an can e

repaire more quickl–all critical components of an enterprise sstem.

The nal part of the RAS

stor is for the sstem to

provie more etails on

exactl what an where the

harware prolem is. Toa’s

sstems ma have over 100

memor moules, ozens of

I/O controllers, an ozens of

isk rives. Traitional repair

approaches, such as swapping

out memor until the prolem

goes awa, simpl aren’t

sufcient.



SaLeS aND iNQUirieS LatiN aMeriCa

+54 11 4329 7300

www.latam.rehat.com

[email protected]

NOrth aMeriCa

1–888–REdHAT1

www.rehat.com

eUrOpe, MiDDLe eaSt

aND aFriCa

00800 7334 2835

www.europe.rehat.com

[email protected]

aSia paCiFiC

+65 6490 4200

www.apac.rehat.com

[email protected]

Re Hat was foune in 1993 an is heaquartere in Raleigh, NC. Toa, with more than 60

ofces around the world, Red Hat is the largest publicly traded technology company fully com -

mitte to open source. That commitment has pai off over time, for us an our customers, prov-

ing the value of open source software an estalishing a viale usiness moel uilt aroun theopen source wa.

aBOUt reD hat

Copright © 2010 Re Hat, Inc. Re Hat, Re Hat Enterprise Linux, the Shaowman logo, Jboss, MetaMatrix, an RHCE are traemarks ofRe Hat, Inc., registere in the U.S. an other countries. Linux® is the registere traemark of Linus Torvals in the U.S. an other countries.

www..m#4276437_1010

rhel6_scalability_wp

Documents