rhel6_scalability_wp
TRANSCRIPT
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 1/14
www..m
Whitepaper
Red Hat enteRpRise Linux 6 scaLabiLity
Scalailit is one of the major areas of focus in Re Hat® Enterprise Linux® 6. The importance
of scalailit is riven the convergence of several factors, which have comine to create a
strong customer eman for a high-performance, prouction-rea Linux infrastructure.
Re Hat Enterprise Linux is now enterprise-proven as the environment for the most emaning
enterprise applications. Re Hat Enterprise Linux has emonstrate the power, reliailit, an
roustness emane prouction environments. A goo example of this is the increasing
use of Re Hat Enterprise Linux to run SAP, one of the most emaning enterprise application
suites.
Customer eman for large Linux sstems is riven the migration from existing large UNIX
sstems running on proprietar RISC architectures to Linux running on stanar harware.
This is riving the eman for large, highl capale Linux sstems to replace aging large, highl
capale UNIX sstems. Another factor is growing eman for IT resources at virtuall all compa-nies. Most of this eman is for large numers of small to mi-range sstems, either lae serv-
ers or rack-mounte servers.
Primaril ase on the x86-64 architecture, these two-socket or four-socket sstems now
inclue 8-64 processors or cores, support a terate of memor, an inclue high-performance
I/O. Toa, a single processor contains 4-16 cores, an this will continue to grow over the next
several ears. What was an extremel large sstem a few ears ago is a mi-range sstem toa–
an will e an entr-level sstem in a few more ears. Further, the large numers of these ss-
tems create challenges for networking, storage, an management.
Thus, Re Hat Enterprise Linux 6 must support the harware capailities an customer
emans of toa while eing positione to support the greatl enhance harware capailities–
an customer emans - of the coming ecae.
2 Sclbly: toy nd pcc
3 Sclbly Dvs
3 64-it Support
4 Processor Count
5 Ticket Spinlocks
5 Tickless Kernel
6 Split LRU VM
7 Control Groups
8 File Sstems
10 Share Storage File Sstems
11 RAS
taBLe OF CONteNtS
eXeCUtiVe SUMMarY
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 2/14
Whitepaper
2www..m
REd HAT ENTERPRISE LINUX 6 SCALAbILITy
Scalability: theory and Practice
There are two components to scalailit: What are the architectural limits of the operating ss-
tem, an how well oes the sstem run prouction workloas. While architectural limits get the
most pulicit, the real question is how well the sstem runs prouction workloas.
To see the ifference etween theor an practice, look at the Re Hat Enterprise Linux sup-
porte sstem limits page at www.d.com/l/com/. Here ou can see oth the theo-
retical limits and the certied system limits. The theoretical limits are the architectural design
limits of each version of Re Hat Enterprise Linux–the values that can’t e exceee. The certi-
ed system limits are the values that have been proven on actual production hardware running
real applications.
There is a famous saing: “In theor, there is no ifference etween theor an practice. In
practice there is.” Red Hat applies both theory and practice to ensure that certied systems
actuall meet customer expectations.
As an example, consier what happene when sstems supporting 128Gb of memor ecame
available and Red Hat increased the certied maximum memory in Red Hat Enterprise Linux5 from 64Gb to 128Gb. Since oth are well within the Re Hat Enterprise Linux 5 architectural
limit of 1Tb, this shoul not require more than plugging in the new, larger memor.
However, there are man sutle aspects of memor management. Going from 64Gb to 128Gb
oules the numer of memor pages that must e manage. Unexpecte performance issues
can arise in ifferent application workloas, requiring changes to the virtual memor suss-
tems. Memor management, uffer management, sstem utilities, an applications ma all nee
to be tuned or modied to provide the expected performance with the larger memory. Without
this tuning, applications ma actuall slow own with the larger memor.
With man ears of experience supporting prouction workloas, Re Hat unerstans that
raising the architectural limits is the starting point, not the en point, in supporting large ss-
tems. This causes Re Hat to e somewhat conservative when new harware comes out–it isn’t
enough for it to just work; it has to work right an meet customer expectations. The onl wa to
achieve this is with real harware, real applications, real workloas, an soli engineering.
This means that Red Hat Enterprise Linux 6 will be introduced with some certied system limits
set below the architectural limits, and that the certied system limits will grow as larger systems
ecome availale an Re Hat has the opportunit, together with its OEM partners, to o the
engineering an tuning work require to make them work well.
Let’s look at some of the changes in Re Hat Enterprise Linux 6 to support the new harware
capailities an customer emans, focusing on the x86-64 architecture.
there iS a
FaMOUS SaYiNg
“In theor, there is no if-
ference etween theor an
practice. In practice there is.”
Re Hat applies oth theor
an practice to ensure that
certied systems actually
meet customer expectations.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 3/14
Whitepaper
3www..m
REd HAT ENTERPRISE LINUX 6 SCALAbILITy
Scalability driverS
The ig rivers of scalailit are 64-it support, processor count, memor size, I/O, an
resource management. Re Hat Enterprise Linux 6 aresses each of these areas.
64-b Suo
Toa’s processors are “64-it”, which shoul give them the ailit to use 18 Exates of mem-
or (1.845 x 1019 tes). However, for a variet of reasons, neither harware nor software actu-
all supports this much memor.
Red Hat Enterprise Linux 5 supports up to 1TB (theoretical) and 256GB (certied). Some sys-
tems have been certied with more memory than the certication limit on an exception basis–
this requires aitional testing an valiation an is hanle on a case--case asis.
Red Hat Enterprise Linux 6 supports up to 64TB of memory (theoretical). The initial certied
limits will e smaller than this; Like Re Hat Enterprise Linux 3, 4, an 5 efore it, the Re Hat
Enterprise Linux 6 certied memory limits will grow over time. Ongoing development and tuning
will e performe to ensure effective use of large (an expensive) memor.
Examples of this work are huge pages an transparent huge pages (oth implemente in
Re Hat Enterprise Linux 6). To explain these, consier how memor is manage in terms of
locks of memor, known as pages. These memor pages are traitionall 4096 tes. This
means that 1Mb of memor is mae up of 256 pages. Likewise, 1Gb of memor is mae of
256,000 pages, an 1Tb of memor is 256 million pages. There is a memor management unit
uilt into the harware that contains a list of these pages, with each page reference through a
page tale entr.
Harware an memor management algorithms that work well with thousans of pages (mega-
bytes of memory) have difculty performing well with millions or billions of pages. This is espe-
ciall critical since the harware memor management unit in moern processors onl supports
hunres or thousans of page tale entries–when more memor pages are use, the sstem
falls ack to slower software-ase memor management.
Effectivel managing large amounts of memor requires either increasing the numer of page
tale entries in the harware memor management unit, which is expensive an causes other
performance issues, or increasing the page size. The most popular sizes are 2Mb an 1Gb, which
are commonl referre to as huge pages. The 2Mb page tales scale well to multiple Gb of mem-
or, an the 1Gb page tales scale well to terates of memor.
While these larger page tales help with large amounts of memor, the require changes to
sstem management an some applications. Huge pages must e assigne when the sstem is
booted, are difcult to manage, and often require signicant changes to applications to be used
effectivel.
Transparent huge pages (THP) implement the huge pages ut automate most aspects of creat-
ing, managing an using the huge pages. Thus, THP hie a lot of the complexit of using hugepages from the sstem aministrator an application evelopers. Since THP is aime at perfor-
mance, the THP evelopers have one consierale work to test an tune across a wie rang
of systems, system congurations, applications, and workloads. While special system tuning
still prouces the est results, transparent huge pages provie excellent performance improve-
ments even with stock settings. Again, note the ifference etween theor an practice.
An aitional complexit is that man new sstems are uilt with NUMA (Non-Uniform Memor
Access). While NUMA greatly simplies designing and building the hardware for large systems,
it makes life more challenging for operating sstem engineers an application programmers.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 4/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
4www..m
The ig change is that memor on a NUMA sstem ma now e local or remote, an it can take
several times longer to access remote memor than local memor. This has man performance
implications that impact operating sstem esign, applications, an sstem management.
Sstem esigners have conclue that NUMA is the onl feasile approach to uiling large ss-tems–or even meium-size sstems–so we can expect to see man more NUMA sstems uring
the Re Hat Enterprise Linux 6 life.
Much work has gone into Linux–especiall in Re Hat Enterprise Linux 6–to optimize for NUMA
an to provie tools to manage users an applications on NUMA sstems. This inclues sstem
changes such as CPU afnity, which tries to prevent an application from unnecessarily moving
between NUMA nodes. This signicantly improves performance. Another tool is CPU pinning,
which allows a program or a system administrator to bind a running application to a specic
CPU or set of CPUs. These tools, an others, make the ifference etween a large sstem that
runs well an one that runs poorl.
pocsso Coun
Re Hat Enterprise Linux 5 supports up to 255 processors (theoretical) an 64 processors (cer-tied). Red Hat Enterprise Linux 6 supports up to 4,096 processors (theoretical). Note that,
from an operating sstem perspective, a core or a hperthrea counts as a processor.
The operating sstem requires a set of information on each processor in the sstem. Through
Red Hat Enterprise Linux 5 this is done in a simple way, by allocating a xed size array in mem-
or containing the information for all processors. Information on an iniviual processor is
otaine inexing into that arra. This is fast, eas, an straightforwar. For relativel small
numbers of processors it works very well. For larger numbers of processors it has signicant
overhea.
The xed size array in Red Hat Enterprise Linux 5 is 255 processors and is a single shared
resource. This can ecome a ottleneck if large numers of processes on large numers of pro-
cessors need to access it at the same time. Third, it is inexible. Adding a single new item to the
processor information becomes very difcult, and may be impossible without breaking the exist-
ing interfaces.
Re Hat Enterprise Linux 6 aresses this moving to a namic list structure for processor
information. The list is allocate namicall–if there are onl eight processors in the sstem,
onl eight entries are create in the list. If there are 2,048 processors in the list, 2,048 entries
are create.
The list structure allows a ner granularity of locking–if, for example, information needs to be
upate at the same time for processors 6, 72, 183, 657, 931, an 1546, this can e one with
greater parallelism. Situations like this oviousl occur much more frequentl on large sstems
than small sstems.
Further, a numer of changes an extensions can now e mae to the processor information
without reaking application compatiilit.
This is a major change that has require a lot of work to implement. In aition to implement-
ing the new list structure, ever component that touches the processor information ha to e
upate. Extensive testing was require. An, of course, performance tuning was neee.
The net result of all these changes is that Re Hat Enterprise Linux 6 can support the new ss-
tems planne over the next several ears. An we have the founation for supporting trul large
sstems, if processor manufacturers ecie to start uiling 64,000 processor sstems.
rheL 6 aND NUMa
RHEL 6 is optimize for
NUMA an provies the
tools neee to manage
users an applications on
NUMA sstems.
Re Hat Enterprise Linux 6
can support new sstems
planne over the next several
ears. An we have the
founation for supporting trul
large sstems, if processor
manufacturers ecie to
start uiling 64,000
processor sstems.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 5/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
5www..m
tck Snlocks
As mentioned, NUMA system architecture greatly simplies hardware design, but places new
emans on software. A ke part of an sstem esign is ensuring that one process oesn’t
change memor eing use another process. data corruption an sstem crashes are theinevitale result of uncontrolle changing of ata. This is one allowing a process to lock a
piece of memor, perform an operation, an then unlock the memor (or free the lock). A com-
mon metho for oing this is a spin lock, where a process will keep checking to see if a lock is
availale an take the lock as soon as it ecomes availale. If there are multiple processes com-
peting for the same lock, the rst one to request the lock after it has been freed gets it. When all
processes have the same access to memor, this approach is fair an works quite well.
Unfortunatel, on a NUMA sstem, not all processes have equal access to the locks. This results
in processes on the same NUMA noe as the lock having an unfair avantage in otaining the
lock. Processes on remote NUMA noes experience lock starvation an egrae performance.
Re Hat Enterprise Linux 6 aresses this issue through a mechanism calle ticket spinlocks,
which as a reservation queue mechanism to the lock. This means that processes that nee to
take the lock will essentiall “get in line” an e allowe to take the lock in the orer that the
requeste it. Timing prolems an unfair avantages in requesting the lock are eliminate.
While a ticket spinlock has slightl more overhea than an orinar spinlock, it is much more
scalale an provies etter performance on NUMA sstems.
tcklss Knl
Another major avance in Re Hat Enterprise Linux 6 is the tickless kernel. Previous versions
use a timer-ase kernel, which ha a clock running that prouces a sstem interrupt or timer
tick several hunre or several thousan times a secon (epening on what the timer is set to),
even when the sstem has nothing to o. Each time the timer prouces an interrupt, the sstem
polls–it looks aroun to see if there is an work to o. If so, it oes it. If not, it goes ack to sleep.
On a lightl loae sstem, this impacts power consumption preventing the processor from
effectivel using sleep states. The sstem uses the least power when it is in a sleep state. There
are several sleep states, with the eeper sleep states requiring even less power. However, the
sleep states require time an power for the sstem to enter them an leave them.
The most efcient way for a system to operate is to do work as quickly as possible and then go
into as deep a sleep state as possible and sleep for as long as possible. But it is very difcult to
get a goo night’s sleep when the sstem timer is constantl waking ou up.
The answer is to remove the interrupt timer from the ile loop an go to a completel interrupt-
riven environment. This allows the sstem to go into eep sleep states when it has nothing to
o, an respon quickl when there is something to o. This removal of timer ticks from the ile
loop prouces what is calle the tickless kernel.
The most efcient way for
a sstem to operate is to o
work as quickl as possile
an then go into as eep a
sleep state as possile an
sleep for as long as possile.
But it is very difcult to get a
goo night’s sleep when the
sstem timer is constantl
waking ou up.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 6/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
6www..m
Sl LrU VM
One of the secrets to performance is to keep things that ma e use close to the processor. All
actual work is one in the CPU registers, so the goal is to make sure that the next instruction or
piece of ata neee the CPU can e loae into the registers as quickl as possile. Sincethere is a trae-off etween spee an size, there is a hierarch of progressivel larger–ut
slower–storage. This hierarchy typically goes rst level cache (L1 cache), second level cache (L2
cache), thir level cache (L3 cache, inclue on man new processors), main memor on local
numa noe, main memor on remote numa noe, local storage (isk), an then remote storage.
Data available in the rst level cache can be accessed millions of times faster than data on
remote storage–but remote storage may be millions of times larger than the rst level cache.
Much of the memor on a running sstem is use for sstem manage caches–copies of infor-
mation kept where the can e accesse in microsecons rather than millisecons. Since the
sstem usuall oesn’t know what piece of information will e neee next, a variet of algo-
rithms are use to preict what will e neee. This inclues keeping in memor information
that has previousl een use–statisticall, if a piece of information has een use once, it is
likel to e use again. data that is written to isk is hel in memor until it has phsicall een
written to isk, an the cop is left in memor as long as the memor isn’t neee for another
purpose–man applications will write ata to isk an then promptl rea it ack to make fur-
ther changes. Other algorithms will rea ahea from storage; If some information has een rea
from a le, there is a high probability that that the next data needed will be the next data in in
the le.
Since there is only a nite amount of memory available, the system quickly gets to the point
where existing ata must e iscare efore new ata can e place in memor. This presents
no prolems, since this ata is a cache (cop) of the original ata. The onl impact to iscaring
it is that it will take aitional time to access it if it is neee again.
Thus, a ke part of sstem performance is how well the sstem takes care of preicting what
ata is likel to e use next, ringing in new ata (sometimes efore it is neee), an gettingri of ol cache ata to make room for new ata.
One of the most powerful algorithms for etermining what ata to iscar is to get ri of the
ata that hasn’t een use in the longest time–the least recentl use (LRU). This approach
keeps track of when each piece of ata in the various caches is use. This is one tagging
pages of memor, an upating the tag each time ata in that page is use (rea or written).
The sstem can then scan through the the cache pages of memor, iscar the pages that
haven’t een use in a long time (evict the pages), an replace them with newer ata–a process
calle page replacement. Or, if an application requests more memor, the LRU algorithm is an
excellent wa to ecie which pages can e iscare to free up memor for the application.
LRU is one of the core algorithms in Linux virtual memor management, an is a vital element
of sstem performance.
As powerful as the LRU algorithm is, there is room for performance improvement. The ol page
replacement mechanism ha two major issues: First, it woul sometimes evict the wrong pages,
causing these pages to have to e rea in again. This woul often occur when the pages that
shoul e evicte were hien ehin other pages in the LRU list, causing the sstem to evict
the pages it could nd, rather than the pages it should evict.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 7/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
7www..m
The secon issue is that the sstem woul repeatel scan over pages that shoul not e
evicte. For example, on a sstem with 80Gb anonmous pages, 10 Gb of page cache, an no
swap, the ol algorithms woul scan the 80Gb of anonmous pages over an over again to get
at the page cache. This results in catastrophic CPU utilization an lock contention on sstemswith more than 128Gb of memor.
Starting from the premise that not all ata is equal, a Re Hat engineer implemente a set of
patches that handle different types of pages differently and nds pages that can be evicted with
minimal scanning. These patches were, of course, pushe upstream an accepte into the Linux
kernel efore eing inclue in Re Hat Enterprise Linux 6. The result is the Split LRU VM (Split
least recentl use virtual memor manager).
The Split LRU VM uses several lists of memor pages instea of a single, monolithic memor
manager. These include separate page lists for lesystem backed data (the master data exists in
a le in the storage subsystem and can be read again whenever needed), swap backed data (the
VM can page out memor to isk an rea it ack in when neee), an non-reclaimale pages
(pages that can not e iscare the VM).
There are also signicant improvements to locking, making the system more scalable for large
numers of processors an large amounts of memor.
The end result is that the Split LRU VM in Red Hat Enterprise Linux 6 delivers a signicant
improvement in sstem performance, especiall for large sstems.
Conol gous
We have alrea note how small sstems are ecoming large sstems, with a two-socket server
or lae now incluing 32 CPUs–an growing. Man simple approaches to managing sstem
resources that worked ne with one processor–or even four processors–do not work well with 32
or more processors.
Re Hat Enterprise Linux provies man options for sstem tuning that work quite well. Large
sstems, scaling to hunres of processors, can e tune to eliver super performance. but
tuning these systems requires considerable expertise and a well-dened workload. When large
sstems were expensive an few in numer, it was acceptale to give them special treatment.
Now that these sstems are mainstream, more effective tools are neee.
Further complicating the situation is the tren to use these more powerful sstems for consoli-
ation an placing the workloas that ma have een running on four to eight oler servers
onto a single new server.
Let’s explore the situation for a moment. A sstem with a single CPU can e effectivel utilize
with a single process. A sstem with four CPUs requires at least four processes to take avan-
tage of it an avoi wasting sstem resources. A sstem with 32 CPUs requires a minimum of
32 processes (one per CPU), an is likel to nee several hunre processes to keep the overall
sstem reasonal us.
Man moern applications are esigne for parallel processing, an use multiple threas or pro-
cesses to improve performance. However, few applications can make effective use of more than
eight to ten threas or processes. Thus, multiple applications tpicall nee to e installe on a
32-CPU sstem to keep it us.
Split LRU VM in Re Hat
Enterprise Linux 6 elivers
a signicant improvement in
sstem performance, espe-
ciall for large sstems.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 8/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
8www..m
Altogether, we are at a place where small, inexpensive mainstream sstems have all the capaili-
ties of the large sstems of a few ears ago, multiple applications are eing consoliate onto a
single server, it isn’t cost-effective to spen the same amount of expertise an tuning that was
previousl eicate to supporting the large sstems, an application workloas have ecomemuch more variale.
Further, some resources–such as isk I/O an network communications–are share resources
that are not growing as fast as CPU count. The result is that an application, or a process within
an application, can consume excessive resources an egrae the performance of the whole
sstem.
A virtualization into this mix, an ou have an urgent nee for etter was to control our
sstems.
Control groups, or cgroups, are a metho for comining sets of tasks an allocating an manag-
ing the amount of resources that the are ale to consume. For example, ou can take a ata-
ase application an give it 80 percent of four CPUs, 60 Gb of memor, an 40 percent of isk
I/O into the SAN. A we application running on the same sstem coul e given two CPUs, 2 Gb
of memor, an 50 percent of availale network anwith.
The result is that oth applications eliver goo performance an o not excessivel consume
system resources. Further, the system is signicantly self-tuning, in that changes in workload
are not likely to signicantly degrade performance.
Cgroups o this in three phases: First, a cgroup is create an a task or set of tasks is assigne
to it. These tasks run within the cgroup. Further, an tasks that are spawne are also in the
cgroup. This means that the entire application can e manage as a unit.
Secon, a set of resources are allocate to the cgroup. These resources inclue cpusets, mem-
or, I/O resources, an network resources.
Cpusets allows assigning a number of CPUs, setting afnity for specic CPUs or nodes (a node
is generally dened as a set of CPUs or cores in a socket), and the amount of CPU time that cane consume. Cpusets are vital for making sure that a cgroup provies goo performance, that
it oes not consume excessive resources at the cost of other tasks, an that it is not starve for
CPU resources it nees.
I/O anwith an network anwith are manage other resource controllers. Again, the
resource controllers allow ou to etermine how much anwith the tasks in a cgroup can
consume, an ensure that the tasks in a cgroup neither consume excessive resources nor are
starve for resources.
The result is that an application developer or system administrator can dene and allocate, at
a high level, the sstem resources that various applications nee an will consume. The sstem
then automaticall manages an alances the various applications, elivering goo preictale
performance an optimizing the performance of the overall sstem.
Fl Sysms
When Red Hat Enterprise Linux 5 rst shipped, 500GB drives had just been introduced and most
rives were in the 100-200Gb range. Most servers at that time woul have storage in the 1-2 Tb
range. Toa 2Tb rives are common, 3Tb rives are ecoming availale, an 4Tb rives are
expecte to e wiel availale in 2011.
We can expect sstems to commonl have 10’s of Tb of storage–or more. Further, SAN ase stor-
age can e expecte to approach 100 Tb.
Control groups, or cgroups,
are a metho for comining
sets of tasks an allocatingan managing the amount of
resources that the are ale
to consume. For example, ou
can take a ataase applica-
tion an give it 80 percent of
four CPUs, 60 Gb of memor,
an 40 percent of isk I/O
into the SAN. A we applica-
tion running on the same ss-
tem coul e given two CPUs,
2 Gb of memor, an 50
percent of availale network
anwith.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 9/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
9www..m
As with other areas, lesystems have both theoretical and practical aspects of scaling, which we
will explore in some etail.
t eXt Fl Sysm Fmly
Ext3 or the third extended lesystem is the long-standing default lesystem in Red Hat
Enterprise Linux. It was introuce in Re Hat Enterprise Linux 2.1 an has een the efault
lesystem for all subsequent releases through Red Hat Enterprise Linux 5. It is well-tuned for
general purpose workloads. Ext3 has long been the most common lesystem for enterprise dis-
triutions an man applications have een evelope on Ext3.
Ext3 supports a maximum lesystem size of 16TB, but practical limits may be lower. Even on a
1TB S-ATA drive, the performance of the Ext3 lesystem repair utility (fsck), which is used to ver-
ify and repair the lesystem after a crash, is extremely long. For many users that require high
availability, this can further reduce the maximum feasible size of an Ext3 lesystem to 2-4TB of
storage.
Ext4 is the fourth generation of the EXT lesystem family, and is the default in Red Hat
Enterprise Linux 6. It supports a maximum lesystem size of one exabyte, and a single le maxi-mum size of 16 Tb. Ext4 as several new features:
• Extent-ase metaata
• delae allocation
• Journal check-summing
• Support for large storage
Extent-based allocation is a more compact and efcient way to track utilized space in a le-
system. This improves lesystem performance and reduces the space consumed by metadata.
Delayed allocation allows the lesystem to put off selecting the permanent location for newly
written user data until the data is ushed to disk. This enables higher performance since it
allows the lesystem to make this decision with much better information.
Additionally, the lesystem repair time (fsck) in Ext4 is much faster than in Ext2 and Ext3. Some
lesystem repairs have been demonstrated a six-fold speedup.
While the Ext4 lesystem itself is theoretically capable of supporting huge amounts of storage,
the maximum supporte limit in Re Hat Enterprise Linux 6 is 16 Tb. Work is neee across a
range of sstem tools such as fsck an performance tuning tools to take avantage of the ai-
tional capacit of Ext4.
For lesystems larger that 16TB, we recommend using a scalable, high-capacity lesystem such
as XFS.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 10/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
10www..m
t XFS Fl Sysm Fmly
XFS is a robust and mature 64-bit journaling le system that supports very large les and le-
systems on a single host. As we mentioned above, journaling ensures lesystem integrity after
system crashes–for example due to power outages--by keeping a record of lesystem operations
that can be replayed when the system is restarted and the lesystem remounted. XFS was origi -
nall evelope in the earl 1990s SGI an has a long histor of running on extremel large
servers an storage arras. XFS supports a wealth of features, incluing, ut not limite to:
• delae allocation
• dnamicall allocate inoes
• b-tree inexing for scalailit of free space management
• Ailit to support large numers of concurrent operations
• Extensive run-time metaata consistenc checking
• Sophisticate metaata rea-ahea algorithms
• Tightl integrate ackup an restore utilities
• Online efragmentation
• Online lesystem growing
• Comprehensive iagnostics capailities
• Scalale an fast repair utilities
• Optimizations for streaming vieo workloas
While XFS scales to exabytes, Red Hat’s maximum supported XFS le system image is 100TB.
Given its long histor in environments that require high performance an scalailit, it is not
surprising that XFS routinely is measured as one of the highest performing lesystems on largesstems with enterprise workloas. For instance, a large sstem woul e one with a relativel
high numer of CPUs, multiple HbAs, an connections to external isk arras. XFS also performs
well on smaller sstems that have a multi-threae, parallel IO workloa. XFS has relativel poor
performance for single-threae, meta-ata intensive workloas--for example, a workloa that
creates or deletes large numbers of small les in a single thread.
XFS is availale with the Scalale File Sstem A-On for Re Hat Enterprise Linux 6.
Sd So Fl Sysms
Shared storage lesystems, sometimes referred to as cluster lesystems, give each server in
the cluster irect access to a share lock storage evice over a local Storage Area Network
(SAN). Shared storage lesystems work on a set of servers that are all members of a cluster.
Unlike network le systems such as NFS, no single server provides access to data or meta-datato other memers: each memer of the cluster has irect access to the same storage evice
(the “shared storage”) and all cluster member nodes access the same set of les.
Cache coherency is paramount in a clustered lesystem to ensure data consistency and integ -
rity. There must be a single version of all les in a cluster that is visible to all nodes within a
cluster. In orer to prevent memers of the cluster from upating the same storage lock at the
same time, which causes data corruption, shared storage le systems use a cluster-wide locking
XFS is a roust an mature
64-bit journaling le
sstem that supports verlarge les and lesystems
on a single host.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 11/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
11www..m
mechanism to arbitrate access to the storage. For example, before creating a new le or writing
to a le that is opened on multiple servers, the lesystem component on the server must obtain
the correct lock.
The most common use of cluster lesystems is to provide a highly available distributed service–
for example, an Apache we server. An memer of the cluster will see a full coherent view of
the data stored in the global lesystem, and all data updates will be managed correctly by the
istriute locking mechanisms.
Cluster lesystems perform well with workloads where each node writes primarily to non-shared
les or where shared les are almost entirely read-only. An example of the rst case would be
a scientic data capture application, where each node is reading a separate stream of data and
writing this to a le that everyone can read. An example of the second case would be a web ser-
vice where multiple noes are reaing a share ataase.
Red Hat Enterprise Linux 6 provides the GFS2 clustered lesystem, available with the Resilient
Storage A-On, which is tightl integrate with Re Hat Enterprise Linux High Availailit clus-
tering, availale with the High Availailit A-On. This provies excellent support for high-per-
formance, high-availailit, scalale, mission-critical applications.
L Boo Dvs
As previousl mentione, 2Tb rives are wiel availale now, 3Tb rives are ecoming avail-
ale, 4Tb rives will e here soon, an rive venors continue to invest in new technolog.
Further, wiesprea use of RAId means that it is common to comine 8-12 phsical rives into a
single large an reliale logical rive.
Through Re Hat Enterprise Linux 5, the largest oot rive that can e supporte is slightl over
2.2Tb. This is ue to the limitations of the bIOS an its master oot recor (MbR) ase isk
partitioning. Larger partitions can e use with GPT (GUId Partition Tale), a moern replace-
ment for MbR, which allows partitions up to 9.4 Zetabtes (9.4 x 1021 tes). Re Hat Enterprise
Linux 5 supports the use of GPT on ata isks, ut bIOS-ase ooting requires the use of the
MbR, thus limiting Re Hat Enterprise Linux 5 to 2.2Tb oot partitions.
Red Hat Enterprise Linux 6 supports both BIOS and UEFI (Unied Extensible Firmware
Interface). UEFI is a replacement for bIOS, which is esigne to support new an emerging har-
ware. The bIOS was create for the original IbM PC. While it has evolve consieral, the bIOS
is ecoming prolematic for moern harware. Re Hat Enterprise Linux 6 on sstems with
UEFI allows use of GPT an larger than 2.2Tb partitions for oth the oot partition an ata
partitions.
raS
In aition to the performance aspects of scalailit, it is also important to look at RAS–reliail-
it, availailit, an serviceailit. RAS ensures that a sstem functions correctl, provies ser-
vices when neee, an can e maintaine or repaire easil with little impact on operation.
The goal of reliailit is to ensure that the sstem is functioning correctl an oes not eliver
erroneous results. Several approaches are used to enhance reliability. The rst of these is to
avoi errors, uiling reliale harware an software. A goo starting point is specifing
high-qualit components, especiall for power supplies, fans, capacitors, memor, an connec-
tors (CPUs generall o not have ifferent qualit graes). Operational factors are also impor-
tant, such as running a processor elow its maximum spee an ensuring goo cooling an
well-conitione power. A highl reliale operating sstem such as Re Hat Enterprise Linux is
vital.
Re Hat Enterprise Linux 6
provies the GFS2 clusterelesystem, available with the
Resilient Storage A-On,
which is tightl integrate
with Re Hat Enterprise Linux
High Availailit clustering,
availale with the High
Availailit A-On. This
provies excellent support
for high-performance, high-
availailit, scalale, mission-
critical applications.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 12/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
12www..m
There are limitations on how reliale an single piece of harware can e mae, so the next step
is to anticipate failures when esigning harware--to etect errors, recover from errors, an
continue. An example of this is the ECC memor use in servers. ECC memor has the ailit to
etect multiple it errors an to correct single it errors an continue, allowing the sstem tocontinue to run with man tpes of memor errors. Toa, ECC-stle approaches are use insie
the CPU, across I/O channels, an even across networks.
Another example is RAId storage. disk rives are one of the most failure prone parts of comput-
ers. Engineers propose that the est wa to eal with isk failures was to use reunant isks,
so that the failure of a isk woul not cause an loss of ata. Furthermore, with proper esign,
the sstem coul continue running while the a isk was replace.
Perhaps the ultimate example is fault-tolerant computing, where ever element of the computer
from CPU an memor to I/O an storage is uplicate. Not onl is ever component uplicate,
ut all operations are snchronize an performe in lock step. An harware that fails is auto-
maticall turne off, an the uplicate harware continues operation.
While these examples can e implemente in harware, the true power of RAS is elivere
when harware, operating sstems, an applications cooperate to eliver a roust sstem.
Consierale work has gone into Re Hat Enterprise Linux 6 oth Re Hat an harware ven-
ors to enhance RAS support.
A simple example of this is stateless operations, such as we servers. It is common to use mul-
tiple we servers, for oth performance an availailit. If a we server fails, requests are sent
to other we servers. If a we server fails while servicing a request, a simple page refresh in the
we rowser will re-run the transaction.
If it isn’t possile to correct the error an continue, it is critical to etect the error an prevent
it from propagating. If this is not one, fault results an corrupte ata can occur. detecting a
harware error is calle a machine check, an ealing with harware errors is the responsiilit
of the machine check architecture (MCA). MCA provies the framework for reporting, logging,
an hanling harware errors.
For man tpes of errors, the sstem can re-tr the operation that faile. If the re-tr succees,
the sstem can continue operation. Of course, the error shoul e logge, to give ou the oppor-
tunity to identify trends and x problems before they occur.
The simplest wa eal with uncorrectale harware errors is to immeiatel stop the sstem
when an error occurs–to crash the sstem. While this is a rute-force approach, it is etter than
allowing errors to continue. This allows work to fail over to another sstem, to reoot the sstem,
or to repair the sstem an then reoot it. The MCA logs an crash umps provie valuale tools
for etermining wh the sstem crashe, whether it was a harware or software prolem, an
input on how to x the system.
More avance versions of MCA, implemente in new harware an Re Hat Enterprise Linux 6,
implement a ne-grained approach. Instead of crashing the entire system, they can disable spe-
cic hardware components–CPUs, sections of memory, NICs, storage controllers, or other com-
ponents. These components will not e use until the are returne to service. This allows the
sstem to continue operating even with harware failures.
As an example, a failure might occur with a NIC in a PCIe hot-swap slot. Further, assume that
there are two NICs in the sstem which are one together so that the share network traf-
c and appear to the system as a single logical NIC. (Bonded NICs can double the network
anwith availale from a single NIC.) The sstem (harware plus Re Hat Enterprise Linux 6)
woul mark the NIC as having an error an isale it. The error woul e reporte to the sstem
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 13/14
Whitepaper REd HAT ENTERPRISE LINUX 6 SCALAbILITy
13www..m
aministrator. The NIC coul e remove an replace with a new one, while the sstem was
running. Finall, the new NIC coul e starte an ae ack to the sstem. during this entire
process, the sstem woul continue to function normall, with the onl visile sign eing a
reuction in network performance. This case requires support from sstem harware, NIC har-ware, NIC river, an the operating sstem. It also, of course, requires support the sstem
aministrator an repair team.
A similar situation applies to multi-path storage. It is common to have two or more paths into
a SAN, for reliailit an performance. The multi-path I/O susstem is esigne to accommo-
date failures in the bre channel links, switches, and storage controllers. As described above for
NICs, a hot-swap bre channel adapter experiencing a hardware error can identied, stopped,
replace, an re-starte with no impact on sstem operation.
Other types of problems have been more difcult to deal with. Failures in CPUs and mem-
or have traitionall rought the entire sstem own. However, this isn’t alwas necessar.
Especiall on larger sstems, there is a goo possiilit that a CPU or memor failure will onl
impact one (or a few) applications, not the operating sstem itself. In this case it is feasile to
stop (“crash”) onl the applications affecte the failure, mark the failing harware so that itwon’t e use, an continue running the sstem.
In man cases, the affecte applications can e immeiatel restarte with minimal owntime.
The nal part of the RAS story is for the system to provide more details on exactly what and
where the harware prolem is. Toa’s sstems ma have over 100 memor moules, ozens
of I/O controllers, an ozens of isk rives. Traitional repair approaches, such as swapping out
memory until the problem goes away, simply aren’t sufcient.
This is wh the new sstems on’t just report “memor error” or “isk error.” Instea the will
report “uncorrectale memor errors in dIMM slots 73,74, an 87” or “fatal error in evice e2,
pci slot 7.” This information tells ou exactl which evices have faile an nee to e replace
an where the are locate.
The comination of new technologies in processors, I/O susstems, I/O evices, rivers, an
Re Hat Enterprise Linux 6 elivers sstems that are more reliale, have higher availailit,
continue to operate in the presence of errors which woul crash earlier sstems, an can e
repaire more quickl–all critical components of an enterprise sstem.
The nal part of the RAS
stor is for the sstem to
provie more etails on
exactl what an where the
harware prolem is. Toa’s
sstems ma have over 100
memor moules, ozens of
I/O controllers, an ozens of
isk rives. Traitional repair
approaches, such as swapping
out memor until the prolem
goes awa, simpl aren’t
sufcient.
8/6/2019 RHEL6_Scalability_WP
http://slidepdf.com/reader/full/rhel6scalabilitywp 14/14
SaLeS aND iNQUirieS LatiN aMeriCa
+54 11 4329 7300
www.latam.rehat.com
NOrth aMeriCa
1–888–REdHAT1
www.rehat.com
eUrOpe, MiDDLe eaSt
aND aFriCa
00800 7334 2835
www.europe.rehat.com
aSia paCiFiC
+65 6490 4200
www.apac.rehat.com
Re Hat was foune in 1993 an is heaquartere in Raleigh, NC. Toa, with more than 60
ofces around the world, Red Hat is the largest publicly traded technology company fully com -
mitte to open source. That commitment has pai off over time, for us an our customers, prov-
ing the value of open source software an estalishing a viale usiness moel uilt aroun theopen source wa.
aBOUt reD hat
Copright © 2010 Re Hat, Inc. Re Hat, Re Hat Enterprise Linux, the Shaowman logo, Jboss, MetaMatrix, an RHCE are traemarks ofRe Hat, Inc., registere in the U.S. an other countries. Linux® is the registere traemark of Linus Torvals in the U.S. an other countries.
www..m#4276437_1010