d book proqramlashdirma_23522

1370
Professional Linux ® Kernel Architecture Wolfgang Mauerer Wiley Publishing, Inc.

Upload: valdomiro-morais

Post on 06-May-2015

1.208 views

Category:

Technology


12 download

TRANSCRIPT

  • 1.Mauerer frs.tex V2 - 08/26/2008 3:23am Page iiiProfessionalLinuxKernel ArchitectureWolfgang MauererWiley Publishing, Inc.

2. Mauerer frs.tex V2 - 08/26/2008 3:23am Page ii 3. Mauerer frs.tex V2 - 08/26/2008 3:23am Page iProfessional LinuxKernel ArchitectureIntroduction .................................................................. xxviiChapter 1: Introduction and Overview .......................................... 1Chapter 2: Process Management and Scheduling ............................. 35Chapter 3: Memory Management ............................................ 133Chapter 4: Virtual Process Memory .......................................... 289Chapter 5: Locking and Interprocess Communication....................... 347Chapter 6: Device Drivers .................................................... 391Chapter 7: Modules .......................................................... 473Chapter 8: The Virtual Filesystem............................................ 519Chapter 9: The Extended Filesystem Family ................................. 583Chapter 10: Filesystems without Persistent Storage ....................... 643Chapter 11: Extended Attributes and Access Control Lists ................. 707Chapter 12: Networks........................................................ 733Chapter 13: System Calls .................................................... 819Chapter 14: Kernel Activities ................................................ 847Chapter 15: Time management .............................................. 893Chapter 16: Page and Buffer Cache.......................................... 949Chapter 17: Data Synchronization ........................................... 989Chapter 18: Page Reclaim and Swapping................................... 1023Chapter 19: Auditing ........................................................ 1097Appendix A: Architecture Specics ......................................... 1117Appendix B: Working with the Source Code ................................ 1141Appendix C: Notes on C ..................................................... 1175Appendix D: System Startup ................................................ 1223Appendix E: The ELF Binary Format ......................................... 1241Appendix F: The Kernel Development Process.............................. 1267Bibliography ................................................................. 1289Index ........................................................................ 1293 4. Mauerer frs.tex V2 - 08/26/2008 3:23am Page ii 5. Mauerer frs.tex V2 - 08/26/2008 3:23am Page iiiProfessionalLinuxKernel ArchitectureWolfgang MauererWiley Publishing, Inc. 6. Mauerer frs.tex V2 - 08/26/2008 3:23am Page ivProfessional LinuxKernel ArchitecturePublished byWiley Publishing, Inc.10475 Crosspoint BoulevardIndianapolis, IN 46256www.wiley.comCopyright 2008 by Wolfgang MauererPublished by Wiley Publishing, Inc., Indianapolis, IndianaPublished simultaneously in CanadaISBN: 978-0-470-34343-2Manufactured in the United States of America10 9 8 7 6 5 4 3 2 1Library of Congress Cataloging-in-Publication Data:Mauerer, Wolfgang, 1978-Professional Linux kernel architecture / Wolfgang Mauerer.p. cm.Includes index.ISBN 978-0-470-34343-2 (pbk.)1. Linux. 2. Computer architecture. 3. Application software. I. Title.QA76.9.A73M38 2008005.432--dc222008028067No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by anymeans, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, orauthorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 RosewoodDrive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should beaddressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317)572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions.Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warrantieswith respect to the accuracy or completeness of the contents of this work and specically disclaim all warranties,including without limitation warranties of tness for a particular purpose. No warranty may be created or extendedby sales or promotional materials. The advice and strategies contained herein may not be suitable for everysituation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting,or other professional services. If professional assistance is required, the services of a competent professional personshould be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that anorganization or Website is referred to in this work as a citation and/or a potential source of further informationdoes not mean that the author or the publisher endorses the information the organization or Website may provideor recommendations it may make. Further, readers should be aware that Internet Websites listed in this work mayhave changed or disappeared between when this work was written and when it is read.For general information on our other products and services please contact our Customer Care Department within theUnited States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Wrox Programmer to Programmer, and related trade dressare trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its afliates, in the United States andother countries, and may not be used without written permission. All other trademarks are the property of theirrespective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not beavailable in electronic books. 7. Mauerer fauth.tex V2 - 08/22/2008 4:52am Page vAbout the AuthorWolfgang Mauerer is a quantum physicist whose professional interests are centered around quantumcryptography, quantum electrodynamics, and compilers for you guessed it quantum architectures.With the conrmed capacity of being the worst experimentalist in the known universe, he sticks to thetheoretical side of his profession, which is especially reassuring considering his constant fear of acci-dentally destroying the universe. Outside his research work, he is fascinated by operating systems, andfor more than a decade starting with an article series about the kernel in 1997 he has found greatpleasure in documenting and explaining Linux kernel internals. He is also the author of a book abouttypesetting with LaTeX and has written numerous articles that have been translated into seven languagesin total.When hes not submerged in vast Hilbert spaces or large quantities of source code, he tries to take theopposite direction, namely, upward be this with model planes, a paraglider, or on foot with an ice axein his hands: Mountains especially have the power to outrival even the Linux kernel. Consequently, heconsiders planning and accomplishing a rst-ascent expedition to the vast arctic glaciers of east Green-land to be the really unique achievement in his life.Being interested in everything that is fundamental, he is also the author of the rst compiler forPlankalkul, the worlds earliest high-level language devised in 19421946 by Konrad Zuse, the father ofthe computer. As an avid reader, he is proud that despite the two-digit number of computers present inhis living room, the volume required for books still occupies a larger share. 8. Mauerer fauth.tex V2 - 08/22/2008 4:52am Page vi 9. Mauerer fcredit.tex V2 - 08/22/2008 4:53am Page viiCreditsExecutive EditorCarol LongSenior Development EditorTom DinseProduction EditorDebra BanningerCopy EditorsCate CaffreyKathryn DugganEditorial ManagerMary Beth WakeeldProduction ManagerTim TateVice President and Executive GroupPublisherRichard SwadleyVice President and ExecutivePublisherJoseph B. WikertProject Coordinator, CoverLynsey StanfordProofreaderPublication Services, Inc.IndexerJack Lewis 10. Mauerer fcredit.tex V2 - 08/22/2008 4:53am Page viii 11. Mauerer fack.tex V4 - 09/04/2008 3:36pm Page ixAcknowledgmentsFirst and foremost, I have to thank the thousands of programmers who have created the Linux kernelover the years most of them commercially based, but some also just for their own private or academicjoy. Without them, there would be no kernel, and I would have had nothing to write about. Please acceptmy apologies that I cannot list all several hundred names here, but in true UNIX style, you can easilygenerate the list by:for file in $ALL_FILES_COVERED_IN_THIS_BOOK; dogit log --pretty="format:%an" $file; done |sort -u -k 2,2It goes without saying that I admire your work very much you are all the true heroes in this story!What you are reading right now is the result of an evolution over more than seven years: After two yearsof writing, the rst edition was published in German by Carl Hanser Verlag in 2003. It then describedkernel 2.6.0. The text was used as a basis for the low-level design documentation for the EAL4+ securityevaluation of Red Hat Enterprise Linux 5, requiring to update it to kernel 2.6.18 (if the EAL acronymdoes not mean anything to you, then Wikipedia is once more your friend). Hewlett-Packard sponsoredthe translation into English and has, thankfully, granted the rights to publish the result. Updates to kernel2.6.24 were then performed specically for this book.Several people were involved in this evolution, and my appreciation goes to all of them: Leslie Mackay-Poulton, with support from David Jacobs, did a tremendous job at translating a huge pile of text intoEnglish. Im also indebted to Sal La Pietra of atsec information security for pulling the strings to get thetranslation project rolling, and especially to Stephan Muller for close cooperation during the evaluation.My cordial thanks also go to all other HP and Red Hat people involved in this evaluation, and also toClaudio Kopper and Hans Lohr for our very enjoyable cooperation during this project. Many thanks alsogo to the people at Wiley both visible and invisible to me who helped to shape the book into itscurrent form.The German edition was well received by readers and reviewers, but nevertheless comments aboutinaccuracies and suggestions for improvements were provided. Im glad for all of them, and would alsolike to mention the instructors who answered the publishers survey for the original edition. Some of theirsuggestions were very valuable for improving the current publication. The same goes for the referees forthis edition, especially to Dr. Xiaodong Zhang for providing numerous suggestions for Appendix F.4.Furthermore, I express my gratitude to Dr. Christine Silberhorn for granting me the opportunity tosuspend my regular research work at the Max Planck Research Group for four weeks to work on thisproject. I hope you enjoyed the peace during this time when nobody was trying to install Linux on yourMacBook!As with every book, I owe my deepest gratitude to my family for supporting me in every aspect oflife I more than appreciate this indispensable aid. Finally, I have to thank Hariet Fabritius for innite 12. Mauerer fack.tex V4 - 09/04/2008 3:36pm Page xAcknowledgmentspatience with an author whose work cycle not only perfectly matched the most alarming forms of sleepdyssomnias, but who was always right on the brink of confusing his native tongue with C, and whomshe consequently had to rescue from numerous situations where he seemingly had lost his mind (seebelow. . .). Now that I have more free time again, Im not only looking forward to our well-deservedholiday, but can nally embark upon the project of giving your laptop all joys of a proper operatingsystem! (Writing these acknowledgments, I all of a sudden realize why people always hasten to lockaway their laptops when they see me approaching. . . .)x 13. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiContentsIntroduction xxviiChapter 1: Introduction and Overview 1Tasks of the Kernel 2Implementation Strategies 3Elements of the Kernel 3Processes, Task Switching, and Scheduling 4Unix Processes 4Address Spaces and Privilege Levels 7Page Tables 11Allocation of Physical Memory 13Timing 16System Calls 17Device Drivers, Block and Character Devices 17Networks 18Filesystems 18Modules and Hotplugging 18Caching 20List Handling 20Object Management and Reference Counting 22Data Types 25. . . and Beyond the Innite 27Why the Kernel Is Special 28Some Notes on Presentation 29Summary 33Chapter 2: Process Management and Scheduling 35Process Priorities 36Process Life Cycle 38Preemptive Multitasking 40Process Representation 41Process Types 47Namespaces 47 14. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiiContentsProcess Identication Numbers 54Task Relationships 62Process Management System Calls 63Process Duplication 63Kernel Threads 77Starting New Programs 79Exiting Processes 83Implementation of the Scheduler 83Overview 84Data Structures 86Dealing with Priorities 93Core Scheduler 99The Completely Fair Scheduling Class 106Data Structures 106CFS Operations 107Queue Manipulation 112Selecting the Next Task 113Handling the Periodic Tick 114Wake-up Preemption 115Handling New Tasks 116The Real-Time Scheduling Class 117Properties 118Data Structures 118Scheduler Operations 119Scheduler Enhancements 121SMP Scheduling 121Scheduling Domains and Control Groups 126Kernel Preemption and Low Latency Efforts 127Summary 132Chapter 3: Memory Management 133Overview 133Organization in the (N)UMA Model 136Overview 136Data Structures 138Page Tables 153Data Structures 154Creating and Manipulating Entries 161Initialization of Memory Management 161Data Structure Setup 162Architecture-Specic Setup 169Memory Management during the Boot Process 191xii 15. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiiiContentsManagement of Physical Memory 199Structure of the Buddy System 199Avoiding Fragmentation 201Initializing the Zone and Node Data Structures 209Allocator API 215Reserving Pages 222Freeing Pages 240Allocation of Discontiguous Pages in the Kernel 244Kernel Mappings 251The Slab Allocator 256Alternative Allocators 258Memory Management in the Kernel 259The Principle of Slab Allocation 261Implementation 265General Caches 283Processor Cache and TLB Control 285Summary 287Chapter 4: Virtual Process Memory 289Introduction 289Virtual Process Address Space 290Layout of the Process Address Space 290Creating the Layout 294Principle of Memory Mappings 297Data Structures 298Trees and Lists 299Representation of Regions 300The Priority Search Tree 302Operations on Regions 306Associating Virtual Addresses with a Region 306Merging Regions 308Inserting Regions 309Creating Regions 310Address Spaces 312Memory Mappings 314Creating Mappings 314Removing Mappings 317Nonlinear Mappings 319Reverse Mapping 322Data Structures 323Creating a Reverse Mapping 324Using Reverse Mapping 325xiii 16. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xivContentsManaging the Heap 327Handling of Page Faults 330Correction of Userspace Page Faults 336Demand Allocation/Paging 337Anonymous Pages 339Copy on Write 340Getting Nonlinear Mappings 341Kernel Page Faults 341Copying Data between Kernel and Userspace 344Summary 345Chapter 5: Locking and Interprocess Communication 347Control Mechanisms 348Race Conditions 348Critical Sections 349Kernel Locking Mechanisms 351Atomic Operations on Integers 352Spinlocks 354Semaphores 355The Read-Copy-Update Mechanism 357Memory and Optimization Barriers 359Reader/Writer Locks 361The Big Kernel Lock 361Mutexes 362Approximate Per-CPU Counters 364Lock Contention and Fine-Grained Locking 365System V Interprocess Communication 366System V Mechanisms 366Semaphores 367Message Queues 376Shared Memory 380Other IPC Mechanisms 381Signals 381Pipes and Sockets 389Summary 390Chapter 6: Device Drivers 391I/O Architecture 391Expansion Hardware 392Access to Devices 397Device Files 397Character, Block, and Other Devices 397xiv 17. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xvContentsDevice Addressing Using Ioctls 400Representation of Major and Minor Numbers 401Registration 403Association with the Filesystem 406Device File Elements in Inodes 406Standard File Operations 407Standard Operations for Character Devices 407Standard Operations for Block Devices 408Character Device Operations 409Representing Character Devices 409Opening Device Files 409Reading and Writing 412Block Device Operations 412Representation of Block Devices 413Data Structures 415Adding Disks and Partitions to the System 423Opening Block Device Files 425Request Structure 427BIOs 430Submitting Requests 432I/O Scheduling 438Implementation of Ioctls 441Resource Reservation 442Resource Management 442I/O Memory 445I/O Ports 446Bus Systems 448The Generic Driver Model 449The PCI Bus 454USB 463Summary 471Chapter 7: Modules 473Overview 473Using Modules 474Adding and Removing 474Dependencies 477Querying Module Information 478Automatic Loading 480Inserting and Deleting Modules 483Module Representation 483Dependencies and References 488xv 18. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xviContentsBinary Structure of Modules 491Inserting Modules 496Removing Modules 505Automation and Hotplugging 506Automatic Loading with kmod 507Hotplugging 508Version Control 511Checksum Methods 512Version Control Functions 515Summary 517Chapter 8: The Virtual Filesystem 519Filesystem Types 520The Common File Model 521Inodes 522Links 522Programming Interface 523Files as a Universal Interface 524Structure of the VFS 525Structural Overview 525Inodes 527Process-Specic Information 532File Operations 537Directory Entry Cache 542Working with VFS Objects 547Filesystem Operations 548File Operations 565Standard Functions 572Generic Read Routine 573The fault Mechanism 576Permission-Checking 578Summary 581Chapter 9: The Extended Filesystem Family 583Introduction 583Second Extended Filesystem 584Physical Structure 585Data Structures 592xvi 19. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xviiContentsCreating a Filesystem 608Filesystem Actions 610Third Extended Filesystem 637Concepts 638Data Structures 639Summary 642Chapter 10: Filesystems without Persistent Storage 643The proc Filesystem 644Contents of /proc 644Data Structures 652Initialization 655Mounting the Filesystem 657Managing /proc Entries 660Reading and Writing Information 664Task-Related Information 666System Control Mechanism 671Simple Filesystems 680Sequential Files 680Writing Filesystems with Libfs 684The Debug Filesystem 687Pseudo Filesystems 689Sysfs 689Overview 690Data Structures 690Mounting the Filesystem 695File and Directory Operations 697Populating Sysfs 704Summary 706Chapter 11: Extended Attributes and Access Control Lists 707Extended Attributes 707Interface to the Virtual Filesystem 708Implementation in Ext3 714Implementation in Ext2 721Access Control Lists 722Generic Implementation 722Implementation in Ext3 726Implementation in Ext2 732Summary 732xvii 20. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xviiiContentsChapter 12: Networks 733Linked Computers 734ISO/OSI and TCP/IP Reference Model 734Communication via Sockets 738Creating a Socket 738Using Sockets 740Datagram Sockets 744The Layer Model of Network Implementation 745Networking Namespaces 747Socket Buffers 749Data Management Using Socket Buffers 750Management Data of Socket Buffers 753Network Access Layer 754Representation of Network Devices 755Receiving Packets 760Sending Packets 768Network Layer 768IPv4 769Receiving Packets 771Local Delivery to the Transport Layer 772Packet Forwarding 774Sending Packets 775Netlter 778IPv6 783Transport Layer 785UDP 785TCP 787Application Layer 799Socket Data Structures 799Sockets and Files 803The socketcall System Call 804Creating Sockets 805Receiving Data 807Sending Data 808Networking from within the Kernel 808Communication Functions 808The Netlink Mechanism 810Summary 817xviii 21. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xixContentsChapter 13: System Calls 819Basics of System Programming 820Tracing System Calls 820Supported Standards 823Restarting System Calls 824Available System Calls 826Implementation of System Calls 830Structure of System Calls 830Access to Userspace 837System Call Tracing 838Summary 846Chapter 14: Kernel Activities 847Interrupts 848Interrupt Types 848Hardware IRQs 849Processing Interrupts 850Data Structures 853Interrupt Flow Handling 860Initializing and Reserving IRQs 864Servicing IRQs 866Software Interrupts 875Starting SoftIRQ Processing 877The SoftIRQ Daemon 878Tasklets 879Generating Tasklets 880Registering Tasklets 880Executing Tasklets 881Wait Queues and Completions 882Wait Queues 882Completions 887Work Queues 889Summary 891Chapter 15: Time Management 893Overview 893Types of Timers 893Conguration Options 896xix 22. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxContentsImplementation of Low-Resolution Timers 897Timer Activation and Process Accounting 897Working with Jifes 900Data Structures 900Dynamic Timers 902Generic Time Subsystem 907Overview 908Conguration Options 909Time Representation 910Objects for Time Management 911High-Resolution Timers 920Data Structures 921Setting Timers 925Implementation 926Periodic Tick Emulation 931Switching to High-Resolution Timers 932Dynamic Ticks 933Data Structures 934Dynamic Ticks for Low-Resolution Systems 935Dynamic Ticks for High-Resolution Systems 938Stopping and Starting Periodic Ticks 939Broadcast Mode 943Implementing Timer-Related System Calls 944Time Bases 944The alarm and setitimer System Calls 945Getting the Current Time 947Managing Process Times 947Summary 948Chapter 16: Page and Buffer Cache 949Structure of the Page Cache 950Managing and Finding Cached Pages 951Writing Back Modied Data 952Structure of the Buffer Cache 954Address Spaces 955Data Structures 956Page Trees 958Operations on Address Spaces 961Implementation of the Page Cache 966Allocating Pages 966Finding Pages 967xx 23. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxiContentsWaiting on Pages 968Operations with Whole Pages 969Page Cache Readahead 970Implementation of the Buffer Cache 974Data Structures 975Operations 976Interaction of Page and Buffer Cache 977Independent Buffers 982Summary 988Chapter 17: Data Synchronization 989Overview 989The pdflush Mechanism 991Starting a New Thread 993Thread Initialization 994Performing Actual Work 995Periodic Flushing 996Associated Data Structures 996Page Status 996Writeback Control 998Adjustable Parameters 1000Central Control 1000Superblock Synchronization 1002Inode Synchronization 1003Walking the Superblocks 1003Examining Superblock Inodes 1003Writing Back Single Inodes 1006Congestion 1009Data Structures 1009Thresholds 1010Setting and Clearing the Congested State 1011Waiting on Congested Queues 1012Forced Writeback 1013Laptop Mode 1015System Calls for Synchronization Control 1016Full Synchronization 1016Synchronization of Inodes 1018Synchronization of Individual Files 1019Synchronization of Memory Mappings 1021Summary 1022xxi 24. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxiiContentsChapter 18: Page Reclaim and Swapping 1023Overview 1023Swappable Pages 1024Page Thrashing 1025Page-Swapping Algorithms 1026Page Reclaim and Swapping in the Linux Kernel 1027Organization of the Swap Area 1028Checking Memory Utilization 1029Selecting Pages to Be Swapped Out 1029Handling Page Faults 1029Shrinking Kernel Caches 1030Managing Swap Areas 1030Data Structures 1030Creating a Swap Area 1035Activating a Swap Area 1036The Swap Cache 1039Identifying Swapped-Out Pages 1041Structure of the Cache 1044Adding New Pages 1045Searching for a Page 1050Writing Data Back 1051Page Reclaim 1052Overview 1053Data Structures 1055Determining Page Activity 1057Shrinking Zones 1062Isolating LRU Pages and Lumpy Reclaim 1065Shrinking the List of Active Pages 1068Reclaiming Inactive Pages 1072The Swap Token 1079Handling Swap-Page Faults 1082Swapping Pages in 1083Reading the Data 1084Swap Readahead 1085Initiating Memory Reclaim 1086Periodic Reclaim with kswapd 1087Swap-out in the Event of Acute Memory Shortage 1090Shrinking Other Caches 1092Data Structures 1092Registering and Removing Shrinkers 1093Shrinking Caches 1093Summary 1095xxii 25. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxiiiContentsChapter 19: Auditing 1097Overview 1097Audit Rules 1099Implementation 1100Data Structures 1100Initialization 1106Processing Requests 1107Logging Events 1108System Call Auditing 1110Summary 1116Appendix A: Architecture Specics 1117Overview 1117Data Types 1118Alignment 1119Memory Pages 1119System Calls 1120String Processing 1120Thread Representation 1122IA-32 1122IA-64 1124ARM 1126Sparc64 1128Alpha 1129Mips 1131PowerPC 1132AMD64 1134Bit Operations and Endianness 1135Manipulation of Bit Chains 1135Conversion between Byte Orders 1136Page Tables 1137Miscellaneous 1137Checksum Calculation 1137Context Switch 1137Finding the Current Process 1138Summary 1139Appendix B: Working with the Source Code 1141Organization of the Kernel Sources 1141Conguration with Kcong 1144A Sample Conguration File 1144xxiii 26. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxivContentsLanguage Elements of Kcong 1147Processing Conguration Information 1152Compiling the Kernel with Kbuild 1154Using the Kbuild System 1154Structure of the Makeles 1156Useful Tools 1160LXR 1161Patch and Diff 1163Git 1165Debugging and Analyzing the Kernel 1169GDB and DDD 1170Local Kernel 1171KGDB 1172User-Mode Linux 1173Summary 1174Appendix C: Notes on C 1175How the GNU C Compiler Works 1175From Source Code to Machine Program 1176Assembly and Linking 1180Procedure Calls 1180Optimization 1185Inline Functions 1192Attributes 1192Inline Assembler 1194__builtin Functions 1198Pointer Arithmetic 1200Standard Data Structures and Techniques of the Kernel 1200Reference Counters 1200Pointer Type Conversions 1201Alignment Issues 1202Bit Arithmetic 1203Pre-Processor Tricks 1206Miscellaneous 1207Doubly Linked Lists 1209Hash Lists 1214Red-Black Trees 1214Radix Trees 1216Summary 1221xxiv 27. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxvContentsAppendix D: System Startup 1223Architecture-Specic Setup on IA-32 Systems 1224High-Level Initialization 1225Subsystem Initialization 1225Summary 1239Appendix E: The ELF Binary Format 1241Layout and Structure 1241ELF Header 1243Program Header Table 1244Sections 1246Symbol Table 1248String Tables 1249Data Structures in the Kernel 1250Data Types 1250Headers 1251String Tables 1257Symbol Tables 1257Relocation Entries 1259Dynamic Linking 1263Summary 1265Appendix F: The Kernel Development Process 1267Introduction 1267Kernel Trees and the Structure of Development 1268The Command Chain 1269The Development Cycle 1269Online Resources 1272The Structure of Patches 1273Technical Issues 1273Submission and Review 1277Linux and Academia 1281Some Examples 1282Adopting Research 1284Summary 1287References 1289Index 1293xxv 28. Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xxvi 29. Mauerer ast.tex V2 - 09/05/2008 12:08pm Page xxviiIntroductionUnix is simple and coherent, but it takes a genius(or at any rate a programmer) to understandand appreciate the simplicity. Dennis RitchieNote from the authors: Yes, we have lost our minds.Be forewarned: You will lose yours too. Benny Goodheart & James CoxUnix is distinguished by a simple, coherent, and elegant design truly remarkable features that haveenabled the system to inuence the world for more than a quarter of a century. And especially thanksto the growing presence of Linux, the idea is still picking up momentum, with no end of the growthin sight.Unix and Linux carry a certain fascination, and the two quotations above hopefully capture the spirit ofthis attraction. Consider Dennis Ritchies quote: Is the coinventor of Unix at Bell Labs completely rightin saying that only a genius can appreciate the simplicity of Unix? Luckily not, because he puts himselfinto perspective immediately by adding that programmers also qualify to value the essence of Unix.Understanding the meagerly documented, demanding, and complex sources of Unix as well as of Linuxis not always an easy task. But once one has started to experience the rich insights that can be gained fromthe kernel sources, it is hard to escape the fascination of Linux. It seems fair to warn you that its easyto get addicted to the joy of the operating system kernel once starting to dive into it. This was alreadynoted by Benny Goodheart and James Cox, whose preface to their book The Magic Garden Explained(second quotation above) explained the internals of Unix System V. And Linux is denitely also capableof helping you to lose your mind!This book acts as a guide and companion that takes you through the kernel sources and sharpens yourawareness of the beauty, elegance, and last but not least esthetics of their concepts. There are, how-ever, some prerequisites to foster an understanding of the kernel. C should not just be a letter; neithershould it be a foreign language. Operating systems are supposed to be more than just a Start button, anda small amount of algorithmics can also do no harm. Finally, it is preferable if computer architecture is notjust about how to build the most fancy case. From an academic point of view, this comes closest to thelectures Systems Programming, Algorithmics, and Fundamentals of Operating Systems. The pre-vious edition of this book has been used to teach the fundamentals of Linux to advanced undergraduatestudents in several universities, and I hope that the current edition will serve the same purpose.Discussing all aforementioned topics in detail is outside the scope of this book, and when you considerthe mass of paper you are holding in your hands right now (or maybe you are not holding it, for thisvery reason), youll surely agree that this would not be a good idea. When a topic not directly related to 30. Mauerer ast.tex V2 - 09/05/2008 12:08pm Page xxviiiIntroductionthe kernel, but required to understand what the kernel does, is encountered in this book, I will brieyintroduce you to it. To gain a more thorough understanding, however, consult the books on computingfundamentals that I recommend. Naturally, there is a large selection of texts, but some books that I foundparticularly insightful and illuminating include C Programming Language, by Brian W. Kernighan andDenis M. Ritchie [KR88]; Modern Operating Systems, by Andrew S. Tanenbaum [Tan07] on the basics ofoperating systems in general; Operating Systems: Design and Implementation, by Andrew S. Tanenbaum andAlbert S. Woodhull [TW06] on Unix (Minix) in particular; Advanced Programming in the Unix Environment,by W. Richard Stevens and Stephen A. Rago [SR05] on userspace programming; and the two volumesComputer Architecture and Computer Organization and Design, on the foundations of computer architectureby John L. Hennessy and David A. Patterson [HP06, PH07]. All have established themselves as classicsin the literature.Additionally, Appendix C contains some information about extensions of the GNU C compiler that areused by the kernel, but do not necessarily nd widespread use in general programming.When the rst edition of this book was written, a schedule for kernel releases was more or less nonexis-tent. This has changed drastically during the development of kernel 2.6, and as I discuss in Appendix F,kernel developers have become pretty good at issuing new releases at periodic, predictable intervals. Ihave focused on kernel 2.6.24, but have also included some references to 2.6.25 and 2.6.26, which werereleased after this book was written but before all technical publishing steps had been completed. Since anumber of comprehensive changes to the whole kernel have been merged into 2.6.24, picking this releaseas the target seems a good choice. While a detail here or there will have changed in more recent kernelversions as compared to the code discussed in this book, the big picture will remain the same for quitesome time.In the discussion of the various components and subsystems of the kernel, I have tried to avoid over-loading the text with unimportant details. Likewise, I have tried not to lose track of the connection withsource code. It is a very fortunate situation that, thanks to Linux, we are able to inspect the source of areal, working, production operating system, and it would be sad to neglect this essential aspect of thekernel. To keep the books volume below the space of a whole bookshelf, I have selected only the mostcrucial parts of the sources. Appendix F introduces some techniques that ease reading of and workingwith the real source, an indispensable step toward understanding the structure and implementation ofthe Linux kernel.One particularly interesting observation about Linux (and Unix in general) is that it is well suited toevoke emotions. Flame wars on the Internet and heated technical debates about operating systems may beone thing, but for which other operating system does there exist a handbook (The Unix-Haters Handbook,edited by Simson Garnkel et al. [GWS94]) on how best to hate it? When I wrote the preface to the rstedition, I noted that it is not a bad sign for the future that a certain international software companyresponds to Linux with a mixture of abstruse accusations and polemics. Five years later, the situationhas improved, and the aforementioned vendor has more or less ofcially accepted the fact that Linux hasbecome a serious competitor in the operating system world. And things are certainly going to improveeven more during the next ve years. . . .Naturally (and not astonishingly), I admit that I am denitely fascinated by Linux (and, sometimes, amalso sure that I have lost my mind because of this), and if this book helps to carry this excitement to thereader, the long hours (and especially nights) spent writing it were worth every minute!Suggestions for improvements and constrictive critique can be passed to [email protected], or viawww.wrox.com. Naturally, Im also happy if you tell me that you liked the book!xxviii 31. Mauerer ast.tex V2 - 09/05/2008 12:08pm Page xxixIntroductionWhat This Book CoversThis book discusses the concepts, structure, and implementation of the Linux kernel. In particular, theindividual chapters cover the following topics: Chapter 1 provides an overview of the Linux kernel and describes the big picture that is investi-gated more closely in the following chapters. Chapter 2 talks about the basics of multitasking, scheduling, and process management, andinvestigates how these fundamental techniques and abstractions are implemented. Chapter 3 discusses how physical memory is managed. Both the interaction with hardware andthe in-kernel distribution of RAM via the buddy system and the slab allocator are covered. Chapter 4 proceeds to describe how userland processes experience virtual memory, and thecomprehensive data structures and actions required from the kernel to implement this view. Chapter 5 introduces the mechanisms required to ensure proper operation of the kernel onmultiprocessor systems. Additionally, it covers the related question of how processes can com-municate with each other. Chapter 6 walks you through the means for writing device drivers that are required to add sup-port for new hardware to the kernel. Chapter 7 explains how modules allow for dynamically adding new functionality to the kernel. Chapter 8 discusses the virtual lesystem, a generic layer of the kernel that allows for supportinga wide range of different lesystems, both physical and virtual. Chapter 9 describes the extended lesystem family, that is, the Ext2 and Ext3 lesystems that arethe standard workhorses of many Linux installations. Chapter 10 goes on to discuss procfs and sysfs, two lesystems that are not designed to storeinformation, but to present meta-information about the kernel to userland. Additionally, a num-ber of means to ease writing lesystems are presented. Chapter 11 shows how extended attributes and access control lists that can help to improve sys-tem security are implemented. Chapter 12 discusses the networking implementation of the kernel, with a specic focus on IPv4,TCP, UDP, and netlter. Chapter 13 introduces how systems calls that are the standard way to request a kernel actionfrom userland are implemented. Chapter 14 analyzes how kernel activities are triggered with interrupts, and presents means ofdeferring work to a later point in time. Chapter 15 shows how the kernel handles all time-related requirements, both with low and highresolution. Chapter 16 talks about speeding up kernel operations with the help of the page and buffercaches. Chapter 17 discusses how cached data in memory are synchronized with their sources on persis-tent storage devices. Chapter 18 introduces how page reclaim and swapping work.xxix 32. Mauerer ast.tex V2 - 09/05/2008 12:08pm Page xxxIntroduction Chapter 19 gives an introduction to the audit implementation, which allows for observing indetail what the kernel is doing. Appendix A discusses peculiarities of various architectures supported by the kernel. Appendix B walks through various tools and means of working efciently with the kernelsources. Appendix C provides some technical notes about the programming language C, and alsodiscusses how the GNU C compiler is structured. Appendix D describes how the kernel is booted. Appendix E gives an introduction to the ELF binary format. Appendix F discusses numerous social aspects of kernel development and the Linux kernelcommunity.xxx 33. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 1Introduction and OverviewOperating systems are not only regarded as a fascinating part of information technology, but arealso the subject of controversial discussion among a wide public.1 Linux has played a major rolein this development. Whereas just 10 years ago a strict distinction was made between relativelysimple academic systems available in source code and commercial variants with varying perfor-mance capabilities whose sources were a well-guarded secret, nowadays anybody can downloadthe sources of Linux (or of any other free systems) from the Internet in order to study them.Linux is now installed on millions of systems and is used by home users and professionals alikefor a wide range of tasks. From miniature embedded systems in wristwatches to massively parallelmainframes, there are countless ways of exploiting Linux productively. And this makes the sourcesso interesting. A sound, well-established concept (Unix) melded with powerful innovations and astrong penchant for dealing with problems that do not arise in academic teaching systems this iswhat makes Linux so fascinating.This book describes the central functions of the kernel, explains its underlying structures, and exam-ines its implementation. Because complex subjects are discussed, I assume that the reader alreadyhas some experience in operating systems and systems programming in C (it goes without sayingthat I assume some familiarity with using Linux systems). I touch briey on several general conceptsrelevant to common operating system problems, but my prime focus is on the implementation of theLinux kernel. Readers unfamiliar with a particular topic will nd explanations on relevant basics inone of the many general texts on operating systems; for example, in Tanenbaums outstanding1It is not the intention of this book to participate in ideological discussions such as whether Linux can be regarded as afull operating system, although it is, in fact, just a kernel that cannot function productively without relying on other com-ponents. When I speak of Linux as an operating system without explicitly mentioning the acronyms of similar projects(primarily the GNU project, which despite strong initial resistance regarding the kernel reacts extremely sensitively whenLinux is used instead of GNU/Linux), this should not be taken to mean that I do not appreciate the importance of thework done by this project. Our reasons are simple and pragmatic. Where do we draw the line when citing those involvedwithout generating such lengthy constructs as GNU/IBM/RedHat/HP/KDE/Linux? If this footnote makes little sense, refer towww.gnu.org/gnu/linux-and-gnu.html, where you will nd a summary of the positions of the GNU project.After all ideological questions have been settled, I promise to refrain from using half-page footnotes in the rest of this book. 34. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 2Chapter 1: Introduction and Overviewintroductions ([TW06] and [Tan07]). A solid foundation of C programming is required. Because thekernel makes use of many advanced techniques of C and, above all, of many special features of the GNUC compiler, Appendix C discusses the ner points of C with which even good programmers may notbe familiar. A basic knowledge of computer structures will be useful as Linux necessarily interacts verydirectly with system hardware particularly with the CPU. There are also a large number of introduc-tory works dealing with this subject; some are listed in the reference section. When I deal with CPUsin greater depth (in most cases I take the IA-32 or AMD64 architecture as an example because Linux isused predominantly on these system architectures), I explain the relevant hardware details. When I dis-cuss mechanisms that are not ubiquitous in daily live, I will explain the general concept behind them,but expect that readers will also consult the quoted manual pages for more advice on how a particularfeature is used from userspace.The present chapter is designed to provide an overview of the various areas of the kernel and to illustratetheir fundamental relationships before moving on to lengthier descriptions of the subsystems in thefollowing chapters.Since the kernel evolves quickly, one question that naturally comes to mind is which version is cov-ered in this book. I have chosen kernel 2.6.24, which was released at the end of January 2008. Thedynamic nature of kernel development implies that a new kernel version will be available by the timeyou read this, and naturally, some details will have changed this is unavoidable. If it were not thecase, Linux would be a dead and boring system, and chances are that you would not want to readthe book. While some of the details will have changed, concepts will not have varied essentially. This isparticularly true because 2.6.24 has seen some very fundamental changes as compared to earlier versions.Developers do not rip out such things overnight, naturally.1.1 Tasks of the KernelOn a purely technical level, the kernel is an intermediary layer between the hardware and the software.Its purpose is to pass application requests to the hardware and to act as a low-level driver to addressthe devices and components of the system. Nevertheless, there are other interesting ways of viewing thekernel. The kernel can be regarded as an enhanced machine that, in the view of the application, abstractsthe computer on a high level. For example, when the kernel addresses a hard disk, it must decidewhich path to use to copy data from disk to memory, where the data reside, which commandsmust be sent to the disk via which path, and so on. Applications, on the other hand, need onlyissue the command that data are to be transferred. How this is done is irrelevant to the appli-cation the details are abstracted by the kernel. Application programs have no contact withthe hardware itself,2 only with the kernel, which, for them, represents the lowest level in thehierarchy they know and is therefore an enhanced machine. Viewing the kernel as a resource manager is justied when several programs are run concurrentlyon a system. In this case, the kernel is an instance that shares available resources CPU time,disk space, network connections, and so on between the various system processes while at thesame time ensuring system integrity.2The CPU is an exception since it is obviously unavoidable that programs access it. Nevertheless, the full range of possible instruc-tions is not available for applications.2 35. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 3Chapter 1: Introduction and Overview Another view of the kernel is as a library providing a range of system-oriented commands. As isgenerally known, system calls are used to send requests to the computer; with the help of the Cstandard library, these appear to the application programs as normal functions that are invokedin the same way as any other function.1.2 Implementation StrategiesCurrently, there are two main paradigms on which the implementation of operating systems is based:1. Microkernels In these, only the most elementary functions are implemented directlyin a central kernel the microkernel. All other functions are delegated to autonomousprocesses that communicate with the central kernel via clearly dened communicationinterfaces for example, various lesystems, memory management, and so on. (Ofcourse, the most elementary level of memory management that controls communicationwith the system itself is in the microkernel. However, handling on the system call level isimplemented in external servers.) Theoretically, this is a very elegant approach becausethe individual parts are clearly segregated from each other, and this forces programmersto use clean programming techniques. Other benets of this approach are dynamicextensibility and the ability to swap important components at run time. However, owingto the additional CPU time needed to support complex communication between thecomponents, microkernels have not really established themselves in practice although theyhave been the subject of active and varied research for some time now.2. Monolithic Kernels They are the alternative, traditional concept. Here, the entire codeof the kernel including all its subsystems such as memory management, lesystems, ordevice drivers is packed into a single le. Each function has access to all other parts ofthe kernel; this can result in elaborately nested source code if programming is not done withgreat care.Because, at the moment, the performance of monolithic kernels is still greater than that of microkernels,Linux was and still is implemented according to this paradigm. However, one major innovation has beenintroduced. Modules with kernel code that can be inserted or removed while the system is up-and-runningsupport the dynamic addition of a whole range of functions to the kernel, thus compensating for some ofthe disadvantages of monolithic kernels. This is assisted by elaborate means of communication betweenthe kernel and userland that allows for implementing hotplugging and dynamic loading of modules.1.3 Elements of the KernelThis section provides a brief overview of the various elements of the kernel and outlines the areas we willexamine in more detail in the following chapters. Despite its monolithic approach, Linux is surprisinglywell structured. Nevertheless, it is inevitable that its individual elements interact with each other; theyshare data structures, and (for performance reasons) cooperate with each other via more functions thanwould be necessary in a strictly segregated system. In the following chapters, I am obliged to makefrequent reference to the other elements of the kernel and therefore to other chapters, although I havetried to keep the number of forward references to a minimum. For this reason, I introduce the individualelements briey here so that you can form an impression of their role and their place in the overall3 36. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 4Chapter 1: Introduction and Overviewconcept. Figure 1-1 provides a rough initial overview about the layers that comprise a complete Linuxsystem, and also about some important subsystems of the kernel as such. Notice, however, that theindividual subsystems will interact in a variety of additional ways in practice that are not shown in thegure.ApplicationsUserspaceC LibraryKernel spaceHardwareDevicedriversCore kernelSystem CallsNetworking Device DriversFilesystemsVFSMemory mgmtArchitecture specific codeProcess mgmtFigure 1-1: High-level overview of the structure of the Linux kernel and thelayers in a complete Linux system.1.3.1 Processes, Task Switching, and SchedulingApplications, servers, and other programs running under Unix are traditionally referred to as processes.Each process is assigned address space in the virtual memory of the CPU. The address spaces of the indi-vidual processes are totally independent so that the processes are unaware of each other as far as eachprocess is concerned, it has the impression of being the only process in the system. If processes want tocommunicate to exchange data, for example, then special kernel mechanisms must be used.Because Linux is a multitasking system, it supports what appears to be concurrent execution of severalprocesses. Since only as many processes as there are CPUs in the system can really run at the sametime, the kernel switches (unnoticed by users) between the processes at short intervals to give them theimpression of simultaneous processing. Here, there are two problem areas:1. The kernel, with the help of the CPU, is responsible for the technical details of task switch-ing. Each individual process must be given the illusion that the CPU is always available. Thisis achieved by saving all state-dependent elements of the process before CPU resources arewithdrawn and the process is placed in an idle state. When the process is reactivated, theexact saved state is restored. Switching between processes is known as task switching.2. The kernel must also decide how CPU time is shared between the existing processes. Impor-tant processes are given a larger share of CPU time, less important processes a smaller share.The decision as to which process runs for how long is known as scheduling.1.3.2 UNIX ProcessesLinux employs a hierarchical scheme in which each process depends on a parent process. The kernelstarts the init program as the rst process that is responsible for further system initialization actionsand display of the login prompt or (in more widespread use today) display of a graphical login interface.init is therefore the root from which all processes originate, more or less directly, as shown graphically4 37. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 5Chapter 1: Introduction and Overviewby the pstree program. init is the top of a tree structure whose branches spread further and furtherdown.wolfgang@meitner> pstreeinit-+-acpid|-bonobo-activati|-cron|-cupsd|-2*[dbus-daemon]|-dbus-launch|-dcopserver|-dhcpcd|-esd|-eth1|-events/0|-gam_server|-gconfd-2|-gdm---gdm-+-X| -startkde-+-kwrapper| -ssh-agent|-gnome-vfs-daemo|-gpg-agent|-hald-addon-acpi|-kaccess|-kded|-kdeinit-+-amarokapp---2*[amarokapp]| |-evolution-alarm| |-kinternet| |-kio_file| |-klauncher| |-konqueror| |-konsole---bash-+-pstree| | -xemacs| |-kwin| |-nautilus| -netapplet|-kdesktop|-kgpg|-khelper|-kicker|-klogd|-kmix|-knotify|-kpowersave|-kscd|-ksmserver|-ksoftirqd/0|-kswapd0|-kthread-+-aio/0| |-ata/0| |-kacpid| |-kblockd/0| |-kgameportd| |-khubd5 38. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 6Chapter 1: Introduction and Overview| |-kseriod| |-2*[pdflush]| -reiserfs/0...How this tree structure spreads is closely connected with how new processes are generated. For thispurpose, Unix uses two mechanisms called fork and exec.1. fork Generates an exact copy of the current process that differs from the parent processonly in its PID (process identication). After the system call has been executed, there are twoprocesses in the system, both performing the same actions. The memory contents of the ini-tial process are duplicated at least in the view of the program. Linux uses a well-knowntechnique known as copy on write that allows it to make the operation much more efcientby deferring the copy operations until either parent or child writes to a page read-onlyaccessed can be satised from the same page for both.A possible scenario for using fork is, for example, when a user opens a second browser win-dow. If the corresponding option is selected, the browser executes a fork to duplicate itscode and then starts the appropriate actions to build a new window in the child process.2. exec Loads a new program into an existing content and then executes it. The memorypages reserved by the old program are ushed, and their contents are replaced with newdata. The new program then starts executing.ThreadsProcesses are not the only form of program execution supported by the kernel. In addition to heavy-weightprocesses another name for classical Unix processes there are also threads, sometimes referred to aslight-weight processes. They have also been around for some time, and essentially, a process may consist ofseveral threads that all share the same data and resources but take different paths through the programcode. The thread concept is fully integrated into many modern languages Java, for instance. In simpleterms, a process can be seen as an executing program, whereas a thread is a program function or routinerunning in parallel to the main program. This is useful, for example, when Web browsers need to loadseveral images in parallel. Usually, the browser would have to execute several fork and exec calls togenerate parallel instances; these would then be responsible for loading the images and making datareceived available to the main program using some kind of communication mechanisms. Threads makethis situation easier to handle. The browser denes a routine to load images, and the routine is startedas a thread with multiple strands (each with different arguments). Because the threads and the mainprogram share the same address space, data received automatically reside in the main program. There istherefore no need for any communication effort whatsoever, except to prevent the threads from steppingonto their feet mutually by accessing identical memory locations, for instance. Figure 1-2 illustrates thedifference between a program with and without threads.W/O Threads With ThreadsAddress SpaceControl FlowFigure 1-2: Processes with and without threads.6 39. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 7Chapter 1: Introduction and OverviewLinux provides the clone method to generate threads. This works in a similar way to fork but enables aprecise check to be made of which resources are shared with the parent process and which are generatedindependently for the thread. This ne-grained distribution of resources extends the classical threadconcept and allows for a more or less continuous transition between thread and processes.NamespacesDuring the development of kernel 2.6, support for namespaces was integrated into numerous subsystems.This allows different processes to have different views of the system. Traditionally, Linux (and Unix ingeneral) use numerous global quantities, for instance, process identiers: Every process in the system isequipped with a unique identier (ID), and this ID can be employed by users (or other processes) to referto the process by sending it a signal, for instance. With namespaces, formerly global resources aregrouped differently: Every namespace can contain a specic set of PIDs, or can provide different viewsof the lesystem, where mounts in one namespace do not propagate into different namespaces.Namespaces are useful; for example, they are benecial for hosting providers: Instead of setting upone physical machine per customer, they can instead use containers implemented with namespaces tocreate multiple views of the system where each seems to be a complete Linux installation from withinthe container and does not interact with other containers: They are separated and segregated from eachother. Every instance looks like a single machine running Linux, but in fact, many such instances canoperate simultaneously on a physical machine. This helps use resources more effectively. In contrast tofull virtualization solutions like KVM, only a single kernel needs to run on the machine and is responsibleto manage all containers.Not all parts of the kernel are yet fully aware of namespaces, and I will discuss to what extent support isavailable when we analyze the various subsystems.1.3.3 Address Spaces and Privilege LevelsBefore we start to discuss virtual address spaces, there are some notational conventions to x. Through-out this book I use the abbreviations KiB, MiB, and GiB as units of size. The conventional units KB, MB,and GB are not really suitable in information technology because they represent decimal powers 103 ,106, and 109 although the binary system is the basis ubiquitous in computing. Accordingly KiB standsfor 210, MiB for 220, and GiB for 230bytes.Because memory areas are addressed by means of pointers, the word length of the CPU determines themaximum size of the address space that can be managed. On 32-bit systems such as IA-32, PPC, andm68k, these are 232 = 4 GiB, whereas on more modern 64-bit processors such as Alpha, Sparc64, IA-64,and AMD64, 264 bytes can be managed.The maximal size of the address space is not related to how much physical RAM is actually available,and therefore it is known as the virtual address space. One more reason for this terminology is that everyprocess in the system has the impression that it would solely live in this address space, and other pro-cesses are not present from their point of view. Applications do not need to care about other applicationsand can work as if they would run as the only process on the computer.Linux divides virtual address space into two parts known as kernel space and userspace as illustrated inFigure 1-3.7 40. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 8Chapter 1: Introduction and Overview0TASK_SIZE232respectively 264UserspaceKernel-spaceFigure 1-3: Division of virtualaddress space.Every user process in the system has its own virtual address range that extends from 0 to TASK_SIZE.The area above (from TASK_SIZE to 232or 264) is reserved exclusively for the kernel and may not beaccessed by user processes. TASK_SIZE is an architecture-specic constant that divides the address spacein a given ratio in IA-32 systems, for instance, the address space is divided at 3 GiB so that the virtualaddress space for each process is 3 GiB; 1 GiB is available to the kernel because the total size of the virtualaddress space is 4 GiB. Although actual gures differ according to architecture, the general concepts donot. I therefore use these sample values in our further discussions.This division does not depend on how much RAM is available. As a result of address space virtualization,each user process thinks it has 3 GiB of memory. The userspaces of the individual system processes aretotally separate from each other. The kernel space at the top end of the virtual address space is alwaysthe same, regardless of the process currently executing.Notice that the picture can be more complicated on 64-bit machines because these tend to use less than64 bits to actually manage their huge principal virtual address space. Instead of 64 bits, they employa smaller number, for instance, 42 or 47 bits. Because of this, the effectively addressable portion of theaddress space is smaller than the principal size. However, it is still larger than the amount of RAM thatwill ever be present in the machine, and is therefore completely sufcient. As an advantage, the CPU cansave some effort because less bits are required to manage the effective address space than are requiredto address the complete virtual address space. The virtual address space will contain holes that are notaddressable in principle in such cases, so the simple situation depicted in Figure 1-3 is not fully valid. Wewill come back to this topic in more detail in Chapter 4.Privilege LevelsThe kernel divides the virtual address space into two parts so that it is able to protect the individualsystem processes from each other. All modern CPUs offer several privilege levels in which processes canreside. There are various prohibitions in each level including, for example, execution of certain assemblylanguage instructions or access to specic parts of virtual address space. The IA-32 architecture uses asystem of four privilege levels that can be visualized as rings. The inner rings are able to access morefunctions, the outer rings less, as shown in Figure 1-4.Whereas the Intel variant distinguishes four different levels, Linux uses only two different modes kernel mode and user mode. The key difference between the two is that access to the memory area aboveTASK_SIZE that is, kernel space is forbidden in user mode. User processes are not able to manipulateor read the data in kernel space. Neither can they execute code stored there. This is the sole domain8 41. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 9Chapter 1: Introduction and Overviewof the kernel. This mechanism prevents processes from interfering with each other by unintentionallyinuencing each others data.1023Kernel-modeUser-modeLessPrivilegesIA-32 LinuxFigure 1-4: Ring system of privilege levels.The switch from user to kernel mode is made by means of special transitions known as system calls; theseare executed differently depending on the system. If a normal process wants to carry out any kind ofaction affecting the entire system (e.g., manipulating I/O devices), it can do this only by issuing a requestto the kernel with the help of a system call. The kernel rst checks whether the process is permitted toperform the desired action and then performs the action on its behalf. A return is then made to user mode.Besides executing code on behalf of a user program, the kernel can also be activated by asynchronoushardware interrupts, and is then said to run in interrupt context. The main difference to running in processcontext is that the userspace portion of the virtual address space must not be accessed. Because interruptsoccur at random times, a random userland process is active when an interrupt occurs, and since theinterrupt will most likely be unconnected with the cause of the interrupt, the kernel has no businesswith the contents of the current userspace. When operating in interrupt context, the kernel must be morecautious than normal; for instance, it must not go to sleep. This requires extra care when writing interrupthandlers and is discussed in detail in Chapter 2. An overview of the different execution contexts is givenin Figure 1-5.Besides normal processes, there can also be kernel threads running on the system. Kernel threads are alsonot associated with any particular userspace process, so they also have no business dealing with theuser portion of the address space. In many other respects, kernel threads behave much more like regularuserland applications, though: In contrast to a kernel operating in interrupt context, they may go to sleep,and they are also tracked by the scheduler like every regular process in the system. The kernel uses themfor various purposes that range from data synchronization of RAM and block devices to helping thescheduler distribute processes among CPUs, and we will frequently encounter them in the course of thisbook.Notice that kernel threads can be easily identied in the output of ps because their names are placedinside brackets:wolfgang@meitner> ps faxPID TTY STAT TIME COMMAND2 ? S< 0:00 [kthreadd]3 ? S< 0:00 _ [migration/0]4 ? S< 0:00 _ [ksoftirqd/0]9 42. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 10Chapter 1: Introduction and Overview5 ? S< 0:00 _ [migration/1]6 ? S< 0:00 _ [ksoftirqd/1]7 ? S< 0:00 _ [migration/2]8 ? S< 0:00 _ [ksoftirqd/2]9 ? S< 0:00 _ [migration/3]10 ? S< 0:00 _ [ksoftirqd/3]11 ? S< 0:00 _ [events/0]12 ? S< 0:00 _ [events/1]13 ? S< 0:00 _ [events/2]14 ? S< 0:00 _ [events/3]15 ? S< 0:00 _ [khelper]...15162 ? S< 0:00 _ [jfsCommit]15163 ? S< 0:00 _ [jfsSync]System call Return fromsystem callMust not beaccessedUserKernelInterrupt Arrows indicate thatCPU executes here( )Figure 1-5: Execution in kernel and user mode. Most of the time, the CPU executescode in userspace. When the application performs a system call, a switch to kernelmode is employed, and the kernel fullls the request. During this, it may access theuser portion of the virtual address space. After the system call completes, the CPUswitches back to user mode. A hardware interrupt also triggers a switch to kernelmode, but this time, the userspace portion must not be accessed by the kernel.On multiprocessor systems, many threads are started on a per-CPU basis and are restricted to run ononly one specic processor. This is represented by a slash and the number of the CPU that are appendedto the name of the kernel thread.Virtual and Physical Address SpacesIn most cases, a single virtual address space is bigger than the physical RAM available to the system. Andthe situation does not improve when each process has its own virtual address space. The kernel and CPUmust therefore consider how the physical memory actually available can be mapped onto virtual addressareas.The preferred method is to use page tables to allocate virtual addresses to physical addresses. Whereasvirtual addresses relate to the combined user and kernel space of a process, physical addresses are usedto address the RAM actually available. This principle is illustrated in Figure 1-6.The virtual address spaces of both processes shown in the gure are divided into portions of equal sizeby the kernel. These portions are known as pages. Physical memory is also divided into pages of thesame size.10 43. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 11Chapter 1: Introduction and OverviewProcess ARAMProcess BPage FrameFigure 1-6: Virtual and physical addresses.The arrows in Figure 1-6 indicate how the pages in the virtual address spaces are distributed across thephysical pages. For example, virtual page 1 of process A is mapped to physical page 4, while virtualpage 1 of process B is mapped to the fth physical page. This shows that virtual addresses change theirmeaning from process to process.Physical pages are often called page frames. In contrast, the term page is reserved for pages in virtualaddress space.Mapping between virtual address spaces and physical memory also enables the otherwise strict sep-aration between processes to be lifted. Our example includes a page frame explicitly shared by bothprocesses. Page 5 of A and page 1 of B both point to the physical page frame 5. This is possible becauseentries in both virtual address spaces (albeit at different positions) point to the same page. Since the ker-nel is responsible for mapping virtual address space to physical address space, it is able to decide whichmemory areas are to be shared between processes and which are not.The gure also shows that not all pages of the virtual address spaces are linked with a page frame. Thismay be because either the pages are not used or because data have not been loaded into memory becausethey are not yet needed. It may also be that the page has been swapped out onto hard disk and will beswapped back in when needed.Finally, notice that there are two equivalent terms to address the applications that run on behalf of theuser. One of them is userland, and this is the nomenclature typically preferred by the BSD community forall things that do not belong to the kernel. The alternative is to say that an application runs in userspace. Itshould be noted that the term userland will always mean applications as such, whereas the term userspacecan additionally not only denote applications, but also the portion of the virtual address space in whichthey are executed, in contrast to kernel space.1.3.4 Page TablesData structures known as page tables are used to map virtual address space to physical address space. Theeasiest way of implementing the association between both would be to use an array containing an entryfor each page in virtual address space. This entry would point to the associated page frame. But there isa problem. IA-32 architecture uses, for example, 4 KiB pages given a virtual address space of 4 GiB,this would produce an array with a million entries. On 64-bit architectures, the situation is much worse.Because each process needs its own page tables, this approach is impractical because the entire RAM ofthe system would be needed to hold the page tables.11 44. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 12Chapter 1: Introduction and OverviewAs most areas of virtual address spaces are not used and are therefore not associated with page frames, afar less memory-intensive model that fullls the same purpose can be used: multilevel paging.To reduce the size of page tables and to allow unneeded areas to be ignored, the architectures split eachvirtual address into multiple parts, as shown in Figure 1-7 (the bit positions at which the address is splitdiffer according to architecture, but this is of no relevance here). In the example, I use a split of the virtualaddress into four components, and this leads to a three-level page table. This is what most architecturesoffer. However, some employ four-level page tables, and Linux also adopts four levels of indirection. Tosimplify the picture, I stick to a three-level variant here.PGD PTEPMD OffsetGlobal PageTable+Middle PageTablePage TableVirtualAddress++Page Frame+Figure 1-7: Splitting a virtual address.The rst part of the virtual address is referred to as a page global directory or PGD. It is used as an indexin an array that exists exactly once for each process. Its entries are pointers to the start of further arrayscalled page middle directories or PMD.Once the corresponding array has been found by reference to the PGD and its contents, the PMD is usedas an index for the array. The page middle directory likewise consists of pointers to further arrays knownas page tables or page directories.The PTE (or page table entry) part of the virtual address is used as an index to the page table. Mappingbetween virtual pages and page frames is achieved because the page table entries point to page frames.The last part of the virtual address is known as an offset. It is used to specify a byte position within thepage; after all, each address points to a uniquely dened byte in address space.A particular feature of page tables is that no page middle tables or page tables need be created for areas ofvirtual address space that are not needed. This saves a great deal of RAM as compared to the single-arraymethod.Of course, this method also has a downside. Each time memory is accessed, it is necessary to run throughthe entire chain to obtain the physical address from the virtual address. CPUs try to speed up this processin two ways:1. A special part of the CPU known as a memory management unit (MMU) is optimized to per-form referencing operations.12 45. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 13Chapter 1: Introduction and Overview2. The addresses that occur most frequently in address translation are held in a fast CPU cachecalled a Translation Lookaside Buffer (TLB). Translation is accelerated because the address datain the cache are immediately available without needing to access the page tables and there-fore the RAM.While caches are operated transparently on many architectures, some require specialattention from the kernel, which especially implies that their contents must be invalidatedwhenever the contents of the page tables have been changed. Corresponding calls must bepresent in every part of the kernel that manipulates page tables. If the kernel is compiled foran architecture that does not require such operations, it automatically ensures that the callsare represented by do-nothing operations.Interaction with the CPUThe IA-32 architecture uses a two-level-only method to map virtual addresses to physical addresses.The size of the address space in 64-bit architectures (Alpha, Sparc64, IA-64, etc.) mandates a three-levelor four-level method, and the architecture-independent part of the kernel always assumes a four-levelpage table.The architecture-dependent code of the kernel for two- and three-level CPUs must therefore emulate themissing levels by dummy page tables. Consequently, the remaining memory management code can beimplemented independently of the CPU used.Memory MappingsMemory mappings are an important means of abstraction. They are used at many points in the kernel andare also available to user applications. Mapping is the method by which data from an arbitrary sourceare transferred into the virtual address space of a process. The address space areas in which mappingtakes place can be processed using normal methods in the same way as regular memory. However, anychanges made are transferred automatically to the original data source. This makes it possible to useidentical functions to process totally different things. For example, the contents of a le can be mappedinto memory. A process then need only read the contents of memory to access the contents of the le,or write changes to memory in order to modify the contents of the le. The kernel automatically ensuresthat any changes made are implemented in the le.Mappings are also used directly in the kernel when implementing device drivers. The input and outputareas of peripheral devices can be mapped into virtual address space; reads and writes to these areas arethen redirected to the devices by the system, thus greatly simplifying driver implementation.1.3.5 Allocation of Physical MemoryWhen it allocates RAM, the kernel must keep track of which pages have already been allocated and whichare still free in order to prevent two processes from using the same areas in RAM. Because memoryallocation and release are very frequent tasks, the kernel must also ensure that they are completed asquickly as possible. The kernel can allocate only whole page frames. Dividing memory into smallerportions is delegated to the standard library in userspace. This library splits the page frames receivedfrom the kernel into smaller areas and allocates memory to the processes.13 46. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 14Chapter 1: Introduction and OverviewThe Buddy SystemNumerous allocation requests in the kernel must be fullled by a continuous range of pages. To quicklydetect where in memory such ranges are still available, the kernel employs an old, but proven technique:The buddy system.Free memory blocks in the system are always grouped as two buddies. The buddies can be allocatedindependently of each other; if, however, both remain unused at the same time, the kernel merges theminto a larger pair that serves as a buddy on the next level. Figure 1-8 demonstrates this using an exampleof a buddy pair consisting initially of two blocks of 8 pages.242023222124202322212420232221AllocatedAllocatedFigure 1-8: The buddy system.All buddies of the same size (1, 2, 4, 8, 16, . . . pages) are managed by the kernel in a special list. Thebuddy pair with two times 8 (16) pages is also in this list.If the system now requires 8 page frames, it splits the block consisting of 16 page frames into two buddies.While one of the blocks is passed to the application that requested memory, the remaining 8 page framesare placed in the list for 8-page memory blocks.If the next request requires only 2 contiguous page frames, the block consisting of 8 blocks is split into2 buddies, each comprising 4 page frames. One of the blocks is put back into the buddy lists, while theother is again split into 2 buddies consisting of 2 blocks of two pages. One is returned to the buddysystem, while the other is passed to the application.14 47. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 15Chapter 1: Introduction and OverviewWhen memory is returned by the application, the kernel can easily see by reference to the addresseswhether a buddy pair is reunited and can then merge it into a larger unit that is put back into the buddylist exactly the reverse of the splitting process. This increases the likelihood that larger memory blocksare available.When systems run for longer periods it is not unusual for servers to run for several weeks or evenmonths, and many desktop systems also tend to reach long uptime a memory management problemknown as fragmentation occurs. The frequent allocation and release of page frames may lead to a situationin which several page frames are free in the system but they are scattered throughout physical addressspace in other words, there are no larger contiguous blocks of page frames, as would be desirable forperformance reasons. This effect is reduced to some extent by the buddy system but not completelyeliminated. Single reserved pages that sit in the middle of an otherwise large continuous free range caneliminate coalescing of this range very effectively. During the development of kernel 2.6.24, some effec-tive measures were added to prevent memory fragmentation, and I discuss the underlying mechanismsin more detail in Chapter 3.The Slab CacheOften the kernel itself needs memory blocks much smaller than a whole page frame. Because it cannot usethe functions of the standard library, it must dene its own, additional layer of memory management thatbuilds on the buddy system and divides the pages supplied by the buddy system into smaller portions.The method used not only performs allocation but also implements a generic cache for frequently usedsmall objects; this cache is known as a slab cache. It can be used to allocate memory in two ways:1. For frequently used objects, the kernel denes its own cache that contains only instances ofthe desired type. Each time one of the objects is required, it can be quickly removed from thecache (and returned there after use); the slab cache automatically takes care of interactionwith the buddy system and requests new page frames when the existing caches are full.2. For the general allocation of smaller memory blocks, the kernel denes a set of slab cachesfor various object sizes that it can access using the same functions with which we are familiarfrom userspace programming; a prexed k indicates that these functions are associated withthe kernel: kmalloc and kfree.While the slab allocator provides good performance across a wide range of workloads, some scalabilityproblems with it have arisen on really large supercomputers. On the other hand of the scale, the overheadof the slab allocator may be too much for really tiny embedded systems. The kernel comes with two drop-in replacements for the slab allocator that provide better performance in these use cases, but offer thesame interface to the rest of the kernel such that it need not be concerned with which low-level allocatoris actually compiled in. Since slab allocation is still the standard methods of the kernel, I will, however,not discuss these alternatives in detail. Figure 1-9 summarizes the connections between buddy system,slab allocator, and the rest of the kernel.Swapping and Page ReclaimSwapping enables available RAM to be enlarged virtually by using disk space as extended memory.Infrequently used pages can be written to hard disk when the kernel requires more RAM. Once the data15 48. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 16Chapter 1: Introduction and Overvieware actually needed, the kernel swaps them back into memory. The concept of page faults is used to makethis operation transparent to applications. Swapped-out pages are identied by a special entry in thepage table. When a process attempts to access a page of this kind, the CPU initiates a page fault that isintercepted by the kernel. The kernel then has the opportunity to swap the data on disk into RAM. Theuser process then resumes. Because it is unaware of the page fault, swapping in and out of the page istotally invisible to the process.Generic kernelcodeBuddy allocator Small boxes indicatepage framesSlab allocatorFigure 1-9: Page frame allocation is performedby the buddy system, while the slab allocatoris responsible for small-sized allocations andgeneric kernel caches.Page reclaim is used to synchronize modied mappings with underlying block devices for this reason,it is sometimes referred to simply as writing back data. Once data have been ushed, the page framecan be used by the kernel for other purposes (as with swapping). After all, the kernel data structurescontain all the information needed to nd the corresponding data on the hard disk when they are againrequired.1.3.6 TimingThe kernel must be capable of measuring time and time differences at various points when schedulingprocesses, for example. Jifes are one possible time base. A global variable named jiffies_64 and its32-bit counterpart jiffies are incremented periodically at constant time intervals. The various timermechanisms of the underlying architectures are used to perform these updates each computer archi-tecture provides some means of executing periodic actions, usually in the form of timer interrupts.Depending on architecture, jiffies is incremented with a frequency determined by the central constantHZ of the kernel. This is usually on the range between 1,000 and 100; in other words, the value of jiffiesis incremented between 1,000 and 100 times per second.Timing based on jiffies is relatively coarse-grained because 1,000 Hz is not an excessively large fre-quency nowadays. With high-resolution timers, the kernel provides additional means that allows forkeeping time in the regime of nanosecond precision and resolution, depending on the capabilities ofthe underlying hardware.It is possible to make the periodic tick dynamic. When there is little to do and no need for frequent periodicactions, it does not make sense to periodically generate timer interrupts that prevent the processor frompowering down into deep sleep states. This is helpful in systems where power is scarce, for instance,laptops and embedded systems.16 49. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 17Chapter 1: Introduction and Overview1.3.7 System CallsSystem calls are the classical method of enabling user processes to interact with the kernel. The POSIXstandard denes a number of system calls and their effect as implemented on all POSIX-compliant sys-tems including Linux. Traditional system calls are grouped into various categories: Process Management Creating new tasks, querying information, debugging Signals Sending signals, timers, handling mechanisms Files Creating, opening, and closing les, reading from and writing to les, querying infor-mation and status Directories and Filesystem Creating, deleting, and renaming directories, querying informa-tion, links, changing directories Protection Mechanisms Reading and changing UIDs/GIDs, and namespace handling Timer Functions Timer functions and statistical informationDemands are placed on the kernel in all these functions. They cannot be implemented in a normal userlibrary because special protection mechanisms are needed to ensure that system stability and/or securityare not endangered. In addition, many calls are reliant on kernel-internal structures or functions to yielddesired data or results this also dictates against implementation in userspace. When a system call isissued, the processor must change the privilege level and switch from user mode to system mode. Thereis no standardized way of doing this in Linux as each hardware platform offers specic mechanisms.In some cases, different approaches are implemented on the same architecture but depend on processortype. Whereas Linux uses a special software interrupt to execute system calls on IA-32 processors, thesoftware emulation (iBCS emulator) of other Unix systems on IA-32 employs a different method toexecute binary programs (for assembly language acionados: the lcall7 or lcall27 gate). Modernvariants of IA-32 also have their own assembly language statement for executing system calls; this wasnot available on old systems and cannot therefore be used on all machines. What all variants have incommon is that system calls are the only way of enabling user processes to switch in their own incentivefrom user mode to kernel mode in order to delegate system-critical tasks.1.3.8 Device Drivers, Block and Character DevicesThe role of device drivers is to communicate with I/O devices attached to the system; for example, harddisks, oppies, interfaces, sound cards, and so on. In accordance with the classical Unix maxim thateverything is a le, access is performed using device les that usually reside in the /dev directory andcan be processed by programs in the same way as regular les. The task of a device driver is to supportapplication communication via device les; in other words, to enable data to be read from and written toa device in a suitable way.Peripheral devices belong to one of the following two groups:1. Character Devices Deliver a continuous stream of data that applications read sequen-tially; generally, random access is not possible. Instead, such devices allow data to be readand written byte-by-byte or character-by-character. Modems are classical examples of char-acter devices.17 50. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 18Chapter 1: Introduction and Overview2. Block Devices Allow applications to address their data randomly and to freely select theposition at which they want to read data. Typical block devices are hard disks because appli-cations can address any position on the disk from which to read data. Also, data can be reador written only in multiples of block units (usually 512 bytes); character-based addressing, asin character devices, is not possible.Programming drivers for block devices is much more complicated than for character devicesbecause extensive caching mechanisms are used to boost system performance.1.3.9 NetworksNetwork cards are also controlled by device drivers but assume a special status in the kernel becausethey cannot be addressed using device les. This is because data are packed into various protocol layersduring network communication. When data are received, the layers must be disassembled and analyzedby the kernel before the payload data are passed to the application. When data are sent, the kernel mustrst pack the data into the various protocol layers prior to dispatch.However, to support work with network connections via the le interface (in the view of applications),Linux uses sockets from the BSD world; these act as agents between the application, le interface, andnetwork implementation of the kernel.1.3.10 FilesystemsLinux systems are made up of many thousands or even millions of les whose data are stored on harddisks or other block devices (e.g., ZIP drives, oppies, CD-ROMs, etc.). Hierarchical lesystems are used;these allow stored data to be organized into directory structures and also have the job of linking othermeta-information (owners, access rights, etc.) with the actual data. Many different lesystem approachesare supported by Linux the standard lesystems Ext2 and Ext3, ReiserFS, XFS, VFAT (for reasons ofcompatibility with DOS), and countless more. The concepts on which they build differ drastically in part.Ext2 is based on inodes, that is, it makes a separate management structure known as an inode availableon disk for each le. The inode contains not only all meta-information but also pointers to the associateddata blocks. Hierarchical structures are set up by representing directories as regular les whose datasection includes pointers to the inodes of all les contained in the directory. In contrast, ReiserFS makesextensive use of tree structures to deliver the same functionality.The kernel must provide an additional software layer to abstract the special features of the various low-level lesystems from the application layer (and also from the kernel itself). This layer is referred to asthe VFS (virtual lesystem or virtual lesystem switch). It acts as an interface downward (this interface mustbe implemented by all lesystems) and upward (for system calls via which user processes are ultimatelyable to access lesystem functions). This is illustrated in Figure 1-10.1.3.11 Modules and HotpluggingModules are used to dynamically add functionality to the kernel at run time device drivers, lesys-tems, network protocols, practically any subsystem3 of the kernel can be modularized. This removesone of the signicant disadvantages of monolithic kernels as compared with microkernel variants.3With the exception of basic functions, such as memory management, which are always needed.18 51. Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 19Chapter 1: Introduction and OverviewModules can also be unloaded from the kernel at run time, a useful aspect when developing new kernelcomponents.Virtual file systemSystem callsExtNBlock layer Device driversXFS ProcFSPage cache Buffer cacheApplications and LibcVarious subsystemsHard disksFigure 1-10: Overview of how the virtual lesystem layer,lesystem implementations, and the block layerinteroperate.Basically, modules are simply normal programs that execute in kernel space rather than in userspace.They must also provide certain sections that are executed when the module is initialized (and terminated)in order to register and de-register the module functions with the kernel. Otherwise, module code hasthe same rights (and obligations) as normal kernel code and can access all the same functions and data ascode that is permanently compiled into the kernel.Modules are an essential requisite to support for hotplugging. Some buses (e.g., USB and FireWire) allowdevices to be connected while the system is running without requiring a system reboot. When the sys-tem detects a new device, the requisite dri