number seven of a series

Post on 31-Dec-2015

40 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Number seven of a series. Drinking from the Firehose Defense against malice and error - security and reliability in the Mill ™ CPU Architecture. Naughty, naughty! Bad program, mustn’t do that!. Talks in this series. Encoding The Belt Memory Prediction Metadata and speculation - PowerPoint PPT Presentation

TRANSCRIPT

04/19/2023 1Out-of-the-Box Computing Patents pending

Number seven of a series

Drinking from the Firehose

Defense against malice and error -security and reliability in the

Mill™ CPU Architecture

Naughty, naughty! Bad program, mustn’t do that!

04/19/2023 2Out-of-the-Box Computing Patents pending

Talks in this series

1. Encoding2. The Belt3. Memory4. Prediction5. Metadata and speculation6. Execution7. Security and reliability8. Specification9. Software pipelines

You are here

Slides and videos of other talks are at:

http://ootbcomp.com/docs

04/19/2023 3Out-of-the-Box Computing Patents pending

The Mill CPU

The Mill is a new general-purpose commercial CPU family.

The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite.

This talk will explain:• the Mill memory and security models• how calls can cross security boundaries safely• how to replace task switches – and save >1000X• how to make most exploits impossible

Not all, mind you!

04/19/2023 4Out-of-the-Box Computing Patents pending

Caution!

Gross over-simplification!

This talk tries to convey an intuitive understanding to the non-specialist.

The reality is more complicated.

(we try not to over-simplify, but sometimes…)

04/19/2023 5Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

Clean, simple – and too expensive

04/19/2023 6Out-of-the-Box Computing Patents pending

Mechanism vs. policy

This talk is about mechanism – how Mill security works.

It is not about policy – how the mechanism is used.

The Mill is a general-purpose CPU architecture.

It is not a Unix machine.It is not a Windows, …, machine.It is not a C machine.It is not a Java, …, machine.

It is a platform in which each of those can implement their own security model.

To the extent that they have one.

04/19/2023 7Out-of-the-Box Computing Patents pending

Some philosophy

Security must be unobtrusive, unavoidable, and cheap

or it won’t be used.

04/19/2023 8Out-of-the-Box Computing Patents pending

Some philosophy

All must have equal security, none more equal than others

No pigs on this farm.

04/19/2023 9Out-of-the-Box Computing Patents pending

The Mill protection model

You can see only what I give you

I can see only what you give me

Fast, cheap, no third-parties

04/19/2023 10Out-of-the-Box Computing Patents pending

What about the OS?

The operating system is an application- like any other.

There are no privileged operations.

There is no Supervisor Mode.

All protection is by memory address.

Byte address.

04/19/2023 11Out-of-the-Box Computing Patents pending

A review

protection vs. translation

No longer coupled

04/19/2023 12Out-of-the-Box Computing Patents pending

load/store FUsretire stations

CPU core decode

I$0e I$0f

D$1 I$1e

L$2

Harvard level 1

shared level 2

DRAM ROMMMIO

device controllers

devices

I$1f

iPLB

TLB

dPLB

View is representative. Actual hierarchy is configured in each chip specification.

The Mill uses virtual caching and the single address space model.

Memory hierarchy from 40,000 ft.

04/19/2023 13Out-of-the-Box Computing Patents pending

load/store FUsretire stations eI$0 fI$0

D$1 eI$1

L$2

Harvard level 1

shared level 2

TLB

DRAM ROMMMIO

device controllers

devices

dPLB iPLB

fI$1

virtual addresses

physical addresses

The Mill uses virtual caching and the single address space model.

Memory hierarchy from 40,000 ft.

04/19/2023 14Out-of-the-Box Computing Patents pending

Memory model

Mill:

All tasks use the same virtual addresses, no aliasing or translation across tasks or OS.

Program addresses must be translated to physical addresses before being looked up in cache.

Traditional:bottleneck

loadoperation

translation/protection lines regs

fault

virtualaddress

physicaladdress

cache CPUTLB

data

load operation

protection

lines belt

fault

PLB

CPUcachedata

virtualaddress

04/19/2023 15Out-of-the-Box Computing Patents pending

Why put translation in front of the cache?

bottleneck

loadoperation

translation/protection lines regs

fault

virtualaddress

physicaladdress

cache CPUTLB

data

To fit in 32-bit memory, different programs must overlap addresses (aliasing). Translation gives each program private memory, even while using the same bit patterns as pointers.

The cost:On the critical path, TLBs must be very fast, small, and power-hungry, and frequently multilevel. Big programs can see 20% or more TLB overhead.

Traditional

04/19/2023 16Out-of-the-Box Computing Patents pending

Why put translation after the cache?

Mill:

All tasks use the same virtual addresses, no aliasing or translation across tasks or OS.

load operation

protection

lines belt

fault

PLB

CPUcachedata

virtualaddress

TLB out of critical path, only referenced on cache misses and evicts; can be big, single-level, and low power.

Pointers can be passed to OS or other tasks without translation; simplifies sharing and protection for apps.

Protection checking done in parallel with cache access.

04/19/2023 17Out-of-the-Box Computing Patents pending

The address space

60 bits

The other four bits in a pointer are not part of the address.

LWB UPB

Regions are parts of space. Regions may overlap.

A region has byte granularity

Regions are parts of the address space, not of memory.

The whole potential data space of a program, included unallocated heap, may be one region.

0

max

04/19/2023 18Out-of-the-Box Computing Patents pending

Region descriptors

LWB UPB rightsregion desc:

0

maxRegions have descriptors, kept in OS tables and cached in the PLB.

IDs

A descriptor gives:

A user matching the identifications can reference the location in the way indicated by the rights.

locationaccess rightsidentifications

read write execute portal …

04/19/2023 19Out-of-the-Box Computing Patents pending

Turf – a collection of regions

regions

turf

A turf has a non-forgeable, globally unique id.

LWB UPB rights

region desc: turf

ID

address space

Region descriptor turf ids may be wild-carded.

A region descriptor contains only one turf id.But the same region can have several descriptors with different turf ids.

Each region descriptor carries a turf id

A turf comprises all regions with descriptors carrying the turf id.

04/19/2023 20Out-of-the-Box Computing Patents pending

Threads – lines of execution

turf 5

A thread runs in a turf – one turf at a time, but can change.

04/19/2023 21Out-of-the-Box Computing Patents pending

Threads – lines of execution

turf 17turf 5

A thread runs in a turf – one turf at a time, but can change.

Note that the descriptors of a turf can describe overlapping regions, possibly with different rights.Note that the descriptors of two different turfs can describe the same region, possibly with different rights.

04/19/2023 22Out-of-the-Box Computing Patents pending

Threads – lines of execution

turf 17turf 5

A thread runs in a turf – one turf at a time, but can change.

While running in turf 5 While running in turf 17

A thread can see and use

A register holds the current turf ID for the thread.

Many threads can be in the same turf concurrently.

04/19/2023 23Out-of-the-Box Computing Patents pending

Threads – lines of execution

regions

region desc: LWB UPB rights

turf ID

address space

thread ID

A thread also has a unique non-forgeable global id.

At power-up, hardware starts an initial thread in the All region, the whole 60-bit address space with all rights.

Your vision increases as you approach the All.Swami Suchananda

A region belongs to a thread if the thread id is in the descriptor.

Region descriptor thread ids may be wild-carded.

04/19/2023 24Out-of-the-Box Computing Patents pending

Granting

Each thread runs in a turf, and has the rights of every region of that turf, as well as the thread’s own rights.

A thread can grant a subset of one of its regions to another turf or thread, with a subset of its rights.

LWB UPB R/Wturf 17

thread *

LWB UPB Rturf 22

thread 5

Granted region descriptors are pushed to the PLB.

granted desc:

owned desc:

Grant is a hardware operation.

04/19/2023 25Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Newly granted region descriptors have a Novel bit in the PLB.

04/19/2023 26Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Evicted novel descriptors are copied to the Table.

04/19/2023 27Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Novel bit is not set in descriptors loaded from the table.

04/19/2023 28Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Evicted non-novel descriptors are discarded.

04/19/2023 29Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Region descriptors pushed to the PLB have the Novel bit set.

04/19/2023 30Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Revoked novel descriptors are simply discarded.

04/19/2023 31Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Descriptors loaded from the table have the Novel bit clear.

04/19/2023 32Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Non-novel descriptors are discarded in the PLB.

And lazily removed from the table

By use of the Novel bit, the great majority of transient grants exist only in the PLB and never go to the Table.

04/19/2023 33Out-of-the-Box Computing Patents pending

Avoiding the PLB

Well Known Regions

04/19/2023 34Out-of-the-Box Computing Patents pending

Avoiding the PLB

Every turf has three Well Known Region descriptors held in registers, not in the PLB: code, data, and constant pool

load module

mapped in memory

binary code

constantsinitialize

d

cpRegcppRe

g

dpReg data

Well Known Regions are created by the loader.

04/19/2023 35Out-of-the-Box Computing Patents pending

Avoiding the PLB

Every thread has two Well Known Region descriptors held in registers, not the PLB: stack and Thread Local

frame

spReg

fpReg

base

frame

frame

frame

frame

The stack region covers only between base and spReg.

limit

The stack region dynamically adjusts to track call/return.

load(ptr,,,)

load(ptr,,,)

data stack region

stack:

04/19/2023 36Out-of-the-Box Computing Patents pending

Avoiding the PLB

Every thread has two Well Known Region descriptors held in registers, not the PLB: stack and Thread Local

frame

spReg

fpReg

base

frame

frame

frame

frame

limit

stack:Hardware initializes every new frame to zero.(see http://ootbcomp.com/docs/Memory)

Beyond the top is inaccessible.

You cannot browse in stack rubble.

Nor can anyone else.

04/19/2023 37Out-of-the-Box Computing Patents pending

Smash and grab

stack protection

04/19/2023 38Out-of-the-Box Computing Patents pending

Smash and grab

Return-oriented programming is an exploit that permits an attacker to execute arbitrary code, even if all code is in a ROM and the hardware prevents execution of data.

It works by smashing the stack (typically a buffer overrun) and then changing the return addresses saved on the stack to point to the desired instructions already in memory.

The target instruction(s) must be followed by a return instruction, which follows another modified address on the stack to the next instructions the attacker wants to execute.

Various defenses make these attacks harder to do

None make them impossible.

04/19/2023 39Out-of-the-Box Computing Patents pending

Mill spiller stack

The Mill has a stack for application data

frame

frame

frame

frame

frame

stack region

04/19/2023 40Out-of-the-Box Computing Patents pending

Mill spiller space

frame

frame

frame

frame

frame

stack region

Mill program state is not kept on the data stack.

spiller

enginecore

spiller space

state

data

Return addresses and other state are in spiller space, not in the app.

Return-oriented exploits are impossible on a Mill

04/19/2023 41Out-of-the-Box Computing Patents pending

How about debuggers?

Apps cannot see the call chain. Whence a backtrace?

app space trace space

spiller space

stackspiller

Trace Service

The Trace Service is a callable API that has read rights in Spiller space.

Trace will return spill state information about a frame to anyone who has read rights to the frame.

Application

04/19/2023 42Out-of-the-Box Computing Patents pending

Service-oriented programming

services

04/19/2023 43Out-of-the-Box Computing Patents pending

Service-oriented programming

A service is a secure, stateful, callable behavior provider.secure, stateful, callable

A service is secureYou can’t tromp on it; it can’t tromp on you.

A service is statefulIt remembers what it was doing for you.It may still be working for you while you’re gone.

A service is callableYou reach it by a normal function call, not a task switch.

The cost is two cache loads per call

04/19/2023 44Out-of-the-Box Computing Patents pending

Service access

A service function is accessed via a portal. A pointer to a portal can be called like any other function pointer.

Portal layout:entry

turf id

data code pool …

The portal is one I$1 cache line, and one fetch to access.The whole line must have Portal permission.

A portal call:

• Spiller saves the Well Known Region descriptors• Loads the WKR descriptors from the portal• Switches the turf ID register to the new turf• Calls to the entry address normally

04/19/2023 45Out-of-the-Box Computing Patents pending

Service access

A portal call is not a process switch or thread switch.

thread 17

frame

code

stack

state

call

turf 9

WKR

frame

application

04/19/2023 46Out-of-the-Box Computing Patents pending

Service access

A portal call is not a process switch or thread switch.

thread 17

frame

code

stack

state

turf 9

WKRframe

service

portal

turf 5

code

state

turf 5

application

call

PLB entry

frame

After a portal call, the same thread is running service code in the service environment, with no old access.

04/19/2023 47Out-of-the-Box Computing Patents pending

But that doesn’t quite work…

frame

stack

frame

application

frame

You can get a fragmented stack if an application and a service call each other back, or services cross-call.

service

frame

frame

frame

frame

Lots of little regions gives poor PLB performance.

Also: what happens on stack overflow?

A service should not be faulted just because a caller was close to its limit.

04/19/2023 48Out-of-the-Box Computing Patents pending

Stacklets

A thread in a service needs its own stack.

stack

application

service A

service B

The logical stack of each thread is a chain of stacklets, one for each turf entered by a nested portal call.

portal call

portal call

But - how can you allocate a stacklet in the middle of a portal call?

04/19/2023 49Out-of-the-Box Computing Patents pending

Stacklets

A stacklet per turf solves the callback problem.

frame

application service

frameWKR

application portal-calls service

WKR

04/19/2023 50Out-of-the-Box Computing Patents pending

Stacklets

A stacklet per turf solves the callback problem.

frame

application

frame

service

frameWKR

service back-calls application

WKR

04/19/2023 51Out-of-the-Box Computing Patents pending

Stacklets

A stacklet per turf solves the callback problem.

frame

application

frame

service

frame

WKR

application re-calls service (nested)

WKR

frame

All frames of a turf/thread combination are adjacent in the stacklet; only one stack-WKR needed.

But – how can you allocate a stacklet in the middle of a portal call?

04/19/2023 52Out-of-the-Box Computing Patents pending

Stacklet allocation

Stacklets are universally allocated at computed addresses.

One sixteenth of the address space is reserved for stacklets.

stacklets

Stacklets are laid out as a two-dimensional array indexed by turf and thread ID.

thread14 15 1716 1918

turf

34

35

36

37

{thread 17 in turf 36}

04/19/2023 53Out-of-the-Box Computing Patents pending

thread ID turf ID 0 63 59 55 33 11 0

0xf

Stacklet allocation

Stacklets are universally allocated at computed addresses.

stack

portal call to turf 94KB

thread 17 in turf 5

0xf00004400005000

stacklet address

0xf00004400009000

A portal call allocates address space.The space is implicitly zero.

http://ootbcomp.com/docs/memory

04/19/2023 54Out-of-the-Box Computing Patents pending

What about callbacks?

If every portal call started a new stack at the thread/turf address, then a callback would put its stack on top of the previous stack:

portal call to turf 9

thread 17 in turf 5

portal call back to turf 5

Oops!

04/19/2023 55Out-of-the-Box Computing Patents pending

The stacklet info block

Associated with each stacklet, and also at a computed address, is a cache-line sized info block with metadata.

TOS base limit

The values are offsets from the computed stacklet address, biased so that all are zero for an unused stacklet.

A portal call writes the current stack WKR to the info blockand fetches the new info block to update the stack WKR.

Because “empty” is all zero, and unbacked loads are implicitly zero, an unused stacklet is empty.

A portal call costs two fetches: the portal and the info block.

04/19/2023 56Out-of-the-Box Computing Patents pending

Arguments

Some function arguments are passed in memory. The pass operation lets the callee see private caller data.

write(fd, ptr, len)

appapp turf

buffer

file service

service turf

pass(ptr, len, r);

pass() pushes a transient thread permission to the PLB

pass() regions are removed by the matching return()

04/19/2023 57Out-of-the-Box Computing Patents pending

pass() considered nuisance

Most pass() calls will be buried in APIs.

program says: write(fd, ptr, len);// calls RTS

RTS says: pass(ptr, len, r);writeSvc(fd, ptr, len);

// portal calls service

If many independent pieces must be passed (like a graph):

1. Allocate an arena2. Build graph in arena3. Pass whole arena

04/19/2023 58Out-of-the-Box Computing Patents pending

Implicit pass()

Mill frame structure

fp

sp

Explicit use of pass() works when the passed region is explicit in the source, but sometimes the callee must access implicitly passed data too: structs passed or returned by value; excess arguments; VARARGS, etc.

The args() operation reserves a portion of the frame for implicit arguments, setting the outp register.

args({size in bytes});// initializationportal call(…);

outpinp The portal call implicitly passes the region between inp and the old sp.

04/19/2023 59Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

service

turfs

04/19/2023 60Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

service

service portal calls

04/19/2023 61Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

service

Simple, clean – and cheap

04/19/2023 62Out-of-the-Box Computing Patents pending

A caution

Mill security regions are very big – and few. They secure entire data spaces and programs, not objects and functions.

Fine-granularity security, in which individual objects and functions can be isolated, requires a different model.

We wish the Mill could support object-level security, but that would require non-standard, non-commodity memory, and would break nearly every C program.

’Tis true, ’tis pity. And pity ’tis, ’tis true.Wm. Shakespeare

We wouldn’t sell any.

04/19/2023 63Out-of-the-Box Computing Patents pending

Summary #1:

The Mill:

• The Mill uses virtual caching and the Single Address Space model.

• There are no privileged operation and no supervisor state; all protection is by address.

• Security environments, called turfs, can include arbitrary regions of the address space, with arbitrary rights.

• A thread can grant a region to a turf or another thread with a subset of its own rights.

04/19/2023 64Out-of-the-Box Computing Patents pending

Summary #2:

The Mill:

• Grants may be implicit or explicit, and can be revoked.

• Protection management uses OS tables and a hardware cache of region descriptors.

• Nearly all access has protection checked using one of five Well Known Regions held in hardware registers, rather than the more expensive general mechanism.

• A call across protection domains (portal call) costs two fetches more than a regular call.

04/19/2023 65Out-of-the-Box Computing Patents pending

Summary #3:

The Mill:

• Both caller and callee are protected from each other.

• The return addresses of calls are safe from stack smashing exploits.

• The OS is yet another service.

• It costs nothing to remove device drivers from privileged state and the OS.

04/19/2023 66Out-of-the-Box Computing Patents pending

Shameless plug

For technical info about the Mill CPU architecture:

http://ootbcomp.com/docs

To sign up for future announcements, white papers etc.

http:/ootbcomp.com/mailing-list

top related