windows kernel and memory io subsystem

Windows Kernel

Sisimon Soman

Lord of the Rings

• x86 processor has 4 layers of protection called Ring 0 – 3.

• Privilege code (Kernel ) runs in Ring 0. Processor ensure that privilege instructions (like enable/disable interrupt, ) execute in kernel mode only.

• User application runs in Ring 3.

• Ring 1 is where the Hyperviser lives..

Rings continued..

How system call works• Cannot directly enter kernel space using jmp or a call instruction.• When make a system call (like CreateFile, ReadFile) OS enter

kernel mode (Ring 0) using instruction int 2E (it is called interrupt gate).

• Code segment descriptor contain information about the ‘Ring’ at which the code can run. For kernel mode modules it will be always Ring 0. If a user mode program try to do ‘jmp <kernel mode address>’ it will cause access violation, because of the segment descriptor flag says processor should be in Ring 0.

• The frequency of entering kernel mode is high (most of the Windows API call cause to enter kernel mode) sysenter is the new optimized instruction to enter kernel mode.

System Call continued..

• Windows maintains a system service dispatch table which is similar to the IDT. Each entry in system service table point to kernel mode system call routine.

• The int 2E probe and copy parameters from user mode stack to thread’s kernel mode stack and fetch and execute the correct system call procedure from the system service table.

• There are multiple system service tables. One table for NT Native APIs, one table for IIS and GDI etc.

System call mechanism..

Lets try it in WinDBG..

• NtWriteFile: mov eax, 0x0E ; build 2195 system service number for NtWriteFile

mov ebx, esp ; point to parameters

int 0x2E ; execute system service trap

ret 0x2C ; pop parameters off stack and return to caller

Software Interrupt Request Levels (IRQLs)

• Windows has its own interrupt priority schemes know as IRQL.

• IRQL levels from 0 to 31, the higher the number means higher priority interrupt level.

• HAL map hardware interrupts to IRQL 3 (Device 1) - IRQL 31 (High)

• When higher priority interrupt occur, it mask the all lower interrupts and execute the ISR for the higher interrupt.

• After executing the ISR, kernel lower the interrupt levels and execute the lower interrupt ISR.

• ISR routine should do minimal work and it should defer the major chunk of work to Deferred Procedure Call (DPC) which run at lower IRQL 2.

Software Interrupt Request Levels (IRQLs)

IRQL and DPC

• DPC concept is similar to other OS, in Linux it is called bottom half.

• DPC is per processor, means a duel processor SMP box contains two DPC Qs.

• The ISR routine generally fetch data from hardware and queue a DPC for further processing.

• IRQL priority is different from thread scheduling priority.

IRQL and DPC

• The scheduler (dispatcher) also runs at IRQL 2.• So a code that execute on or above IRQL

2(dispatch level) cannot preempt.• From the Diagram, see only hardware interrupts

and some higher priority interrupts like clock, power fail are above IRQL 2.

• Most of the time OS will be in IRQL 0(Passive level)

• All user programs and most of the kernel code execute on Passive level only.

IRQL continued..• Scheduler runs at IRQL 2, so what happen if my driver try to wait on

or above dispatch level ?.• Simple system will crash with ‘Blue Screen’, usually with the bug

check ID IRQL_NOT_LESSTHAN_EQUAL.• Because if wait above dispatch level, no one there to come and

switch the thread.• What happen if try to access a PagedPool in above dispatch level ?.• If the pages are on disk, then a page fault exception will happen, the

current thread need to wait and page fault handler will read the pages from page file to page frames in memory.

• If page fault happen above the dispatch level, no one there to stop the current thread and schedule the page fault handler. Thus cannot access PagedPool on or above dispatch level.

IRQL 1 - APCs

• Asynchronous Procedure Call (APC) run at IRQL 1. • The main duty of APC is to send the data to user thread

context.• APC Q is thread specific, each thread has its own APC

Q.• User space thread initiate the read operation from a

device and either it wait to finish it or continue with another job.

• The IO may finish sometime later, now the buffer need to send to the calling thread’s process context. It is the duty of APC.

IO Manager

File System

Volume Manager

Disk Class Driver

Hardware Driver

IO Manager

App issue ReadFile

NtReadFile

IO Mgr create IRP Packet, send to driver stack

User Land

Kernel Land

IRP

What is IO Request Packet (IRP)

• IO Operation passes thru, – Different stages.– Different threads.– Different drivers.

• IRP Encapsulate the IO request.

• IRP is thread independent.

IO Request Packet (IRP)

• When a thread initiate an IO operation, IO Manager create a data structure call IO Request Packet (IRP).

• The IRP contains all information about the request.

• IO Manager send the IRP to the top device in the driver stack.

• Demo : !irpfind to see all current IRPs.Demo : !irp <irp address> to see information about one IRP.

IRP Continued..

• Compare IRP with Windows Messages -MSG structure.

• Each driver in the stack do its own task, finally forward the IRP to the lower driver in the stack.

• IRP can be processed synchronously or asynchronously.

IRP Continued..

• Usually lower level hardware driver takes more time. H/W driver can mark the IRP for pending and return.

• When H/W finish IO, H/W driver complete the IRP by calling IoCompleteRequest().

• IoCompleteRequest() call IO completion routine set by drivers in stack and complete the IO.

Structure of IRP

• Fixed IRP Header

• Variable Stack locations –– One sub stack per driver

IRP Header

Stack Location 1

Stack Location 2

Stack Location 3

Stack Location N

Flow of IRP

IRP Header

Stack Location 1

Stack Location 2

Stack Location 3

Stack Location 4

File System

Volume Manager

Disk Class Driver

Hardware Driver

Storage Stack

IRP for Storage Stack

Forward IRP to lower driver in the stack

Flow of IRP Completion

IRP Header

Stack Location 1

Stack Location 2

Stack Location 3

Stack Location 4

File System – Completion Routine

Volume Manager – Completion Routine

Disk Class Driver – Completion Routine

Hardware Driver – Complete the IRP

Storage Stack

IRP for Storage Stack

Call the completion routine while completing the IRP

IRP Header

• IO buffer Information.

• Flags– Page IO Flag– No Caching IO flag

• IO Status – On Completion set this to IO Completed.

• IRP cancel routine

IRP Stack Location

• IO Manager get the driver count in the stack from the top device in the stack.

• While creating IRP, IO manager allocate the IO stack locations equal to the device count from the top device object.

Contents of IO Stack Location

• IO Completion routine specific to the driver.

• File object specific to the request.

Asynchronous IO

• CreateFile(…, FILE_FLAG_OVERLAPPED ,..), ReadFile(.., LPOVERLAPPED)

• When complete the IO operation, IO Mgr signal the EVENT in LPOVERLAPPED.

How Async IO work in Kernel

• Lower layer driver complete IRP in arbitrary thread context.

• IO Mgr call IO Completion routine in reverse order.

• If operation is Async, IO Mgr queue an APC specific to the initiator thread.

• This APC has complete info of buffer, size info.• This APC get executed later in the context of

initiator thread, which copy buffer to user space, trigger the event set by App.

Common issues related IRP

• After forward the IRP down, don’t touch it (except from IO completion routine).

• If lower driver mark the IRP for pending, all top layer driver should do the same.

• If a middle level driver need to keep the IRP for further processing after completing it by lower driver, it can return STATUS_MORE_PROCESSING REQUIRED from completion routine.

• Middle layer driver should complete it later.• See ReactOS source code (instead of reading 20 page

doc)• FastIO - Concepts

Memory and Cache Manager

Locality Theory

• If access page/cluster n, high possibility to access blocks near to n.

• All memory based computing system working on this principle.

• Windows has registry keys to configure pre-fetch how many blocks/pages.

• Application specific memory manager like Databases, multimedia workload, have application aware pre-fetching.

Virtual Memory Manager (VMM)

• Apps feels memory is unlimited – magic done by VMM.

• Multiple apps run concurrently with out interfering other apps data.

• Apps feel the entire resource is mine.• Protect OS memory from apps.• Advanced app may need to share

memory. Provide solution to memory sharing easily.

VMM Continued..

• VMM reserve certain amount of memory to Kernel.

• 32 bit box , 2GB for Kernel and 2GB for User apps.

• Specific area in Kernel memory reserved to store process specific data like PDE, PTE etc called Hyper Space

Segmentation and Paging

• X86 processor has segmentation and paging support.

• Can disable or enable paging, but segmentation is enabled by default.

• Windows uses paging.

• Since not able to disable segmentation, it consider the entire memory for segments (also called ‘flat segments’).

Paging

• Divide entire physical memory in to equal size pages (4K size for x86 platforms). This is called ‘page frames’ and list called ‘page frame database’ (PF DB).

• PF DB also contains flags stating, read/write underway , shared page , etc.

VMM Continued..

• Upper 2GB Kernel space is common for all process.

• What is it mean – Half of PDE is common to all process !.

• Experiment – See the PDE of two process and make sure half of the PDE is same

Physical to Virtual address translation

• Address translation in both direction – When write PF to pagefile, VMM need to update proper PDE/PTE stating page is in disk.

• Done by– Memory Management Unit (MMU) of the processor.– The VMM help MMU.

• VMM keep the PDE/PTE info and pass to MMU during process context switch.

• MMU translate virtual address to physical address.

• CR3 register

Translation Lookaside Buffer (TLB)

• Address translation is costly operation• It happen frequently – when even touches virtual

memory.• TLB keeps a list containing most frequent

address translations.• The list is tagged by process ID.• TLB is a generic OS concept - implementation is

architecture dependent.• Before doing the address translation MMU

search TLB for the PF.

Address Translation

• In x86 32 bit address – 10 bits of MSB points to the PTE offset in PDE. Thus PDE size of process is 1024 bytes.

• Next 10 bits point to the PF starting address in PTE. Thus each PTE contains 1024 bytes.

• Remaining 12 bits to address the location in the PF. Thus page size is 4K.

What is a Zero Page

• Page frames not specific to apps.• If App1 write sensitive data to PF1, and later VMM push

the page to page file, attach PF 1 to App2. App2 can see these sensitive info.

• It’s a big security flaw, VMM keep a Zero Page list.• Cannot clean the page while freeing memory – it’s a

performance problem.• VMM has dedicated thread who activate when system

under low memory situation and pick page frames from free PF list, clean it and push to zero page list.

• VMM allocate memory from zero page list.

Arbitrary Thread Context

• Top layer of the driver stack get the request (IRP) in the same process context.

• Middle or lower layer driver MAY get the request in any thread context (Ex: IO completion), the current running thread context.

• The address in the IRP is specific to the PDE/PTE in the original process context.

Arbitrary Thread Context continued..

• How to solve the issue ?.

• Note the half of the PDE (Kernel area) is common in all process.

• If some how map to the kernel memory (Upper half of PDE), the buffer is accessible from all process.

Mapping buffer to Kernel space

• Allocate kernel pool from the calling process context, copy user buffer to this Kernel space.

• Memory Descriptor List (MDL) – Most commonly used mechanism to keep data in Kernel space.

Standby list

• To reclaim pages from a process, VMM first move pages to Standby list.

• VMM keep it there for a pre-defined ticks.• If process refer the same page, VMM remove from

standby list and assign to process.• VMM free the pages from Standby list after the timeout

expire.• Pages in standby list is not free, not belong to a process

also.• VMM keep a min and max value for free and standby

page count. If its out of the limits, appropriate events will signaled and adjust the appropriate lists.

Miscellaneous VMM Terms

• Paged Pool

• Non Paged Pool

• Copy on write (COW)

Cache Manager

Cache Manager concepts

• If disk heads run in the speed of super sonic jets, Cache Manager not required.

• Disk access is the main bottleneck that reduce the system performance. Faster CPU and Memory, but disk is still in stone age.

• Common concept in Operating Systems, Unix flavor called ‘buffer cache’.

What Cache Manager does

• Keep the system wide cache data of frequently used secondary storage blocks.

• Facilitate read ahead , write back to improve the overall system performance.

• With write-back, cache manager combine multiple write requests and issue single write request to improve performance. There is a risk associated with write-back.

How Cache Manager works

• Cache Manager implement caching using Memory Mapping.

• The concept is similar to an App uses memory mapped file.

• CreateFile(…dwFlagsAndAttributes ,..)• dwFlagsAndAttributes ==

FILE_FLAG_NO_BUFFERING means I don’t want cache manager.

How Cache Manager works..

• Cache Manager reserve area in higher 2GB (x86 platform) system area.

• The Cache Manager reserved page count adjust according to the system memory requirement.

• If system has lots of IO intensive tasks, system dynamically increase the cache size.

• If system under low memory situation, reduce the buffer cache size.

How cached read operation works

File System Cache Manager

VMM

User Space

Kernel SpaceCached Read (1)

Get the Pages From CM (2)

Do Memory Mapping (3)

Page Fault (4)

Disk stack(SCSI/Fibre Channel)

Get the blocks from disk (5)

How cached write operation works

File System Cache Manager

VMM

User Space

Kernel SpaceCached Write (1)

Copy Pages to CM (2)

Do Memory Mapping (3), Copy data to VMM pages.

Disk stack(SCSI/Fibre Channel)

Write the blocks to disk (5)

Modified Page Writer Thread

of VMM

Write to disk later(4)

Storage Stack Comparison – Windows vs. Linux

File System(NTFS)

Volume Manager

Class Driver (disk.sys)

Port Driver(ex: storport)

MiniPort (emulex HBA )

VFS

File System(ext2, ext3,.)

Cache Mgr

Block Layer(LVM, RAID)

Upper SCSI (Disk, CD)

IO Scheduler

SCSI Mid layer

SCSI lower layer(HW)

Cache Mgr

Questions ?

windows kernel and memory io subsystem

Documents

kernel mode system

irql priority

irp packet

lower irql

io requestpacket irp

irql levels

threads kernel mode

correct system