number seven of a series

66
06/14/2022 1 Out-of-the-Box Computing Patents pending Number seven of a series Drinking from the Firehose Defense against malice and error - security and reliability in the Mill™ CPU Architecture Naughty, naughty! Bad program, mustn’t do that!

Upload: quincy-hooper

Post on 31-Dec-2015

40 views

Category:

Documents


0 download

DESCRIPTION

Number seven of a series. Drinking from the Firehose Defense against malice and error - security and reliability in the Mill ™ CPU Architecture. Naughty, naughty! Bad program, mustn’t do that!. Talks in this series. Encoding The Belt Memory Prediction Metadata and speculation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Number seven of a series

04/19/2023 1Out-of-the-Box Computing Patents pending

Number seven of a series

Drinking from the Firehose

Defense against malice and error -security and reliability in the

Mill™ CPU Architecture

Naughty, naughty! Bad program, mustn’t do that!

Page 2: Number seven of a series

04/19/2023 2Out-of-the-Box Computing Patents pending

Talks in this series

1. Encoding2. The Belt3. Memory4. Prediction5. Metadata and speculation6. Execution7. Security and reliability8. Specification9. Software pipelines

You are here

Slides and videos of other talks are at:

http://ootbcomp.com/docs

Page 3: Number seven of a series

04/19/2023 3Out-of-the-Box Computing Patents pending

The Mill CPU

The Mill is a new general-purpose commercial CPU family.

The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite.

This talk will explain:• the Mill memory and security models• how calls can cross security boundaries safely• how to replace task switches – and save >1000X• how to make most exploits impossible

Not all, mind you!

Page 4: Number seven of a series

04/19/2023 4Out-of-the-Box Computing Patents pending

Caution!

Gross over-simplification!

This talk tries to convey an intuitive understanding to the non-specialist.

The reality is more complicated.

(we try not to over-simplify, but sometimes…)

Page 5: Number seven of a series

04/19/2023 5Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

Clean, simple – and too expensive

Page 6: Number seven of a series

04/19/2023 6Out-of-the-Box Computing Patents pending

Mechanism vs. policy

This talk is about mechanism – how Mill security works.

It is not about policy – how the mechanism is used.

The Mill is a general-purpose CPU architecture.

It is not a Unix machine.It is not a Windows, …, machine.It is not a C machine.It is not a Java, …, machine.

It is a platform in which each of those can implement their own security model.

To the extent that they have one.

Page 7: Number seven of a series

04/19/2023 7Out-of-the-Box Computing Patents pending

Some philosophy

Security must be unobtrusive, unavoidable, and cheap

or it won’t be used.

Page 8: Number seven of a series

04/19/2023 8Out-of-the-Box Computing Patents pending

Some philosophy

All must have equal security, none more equal than others

No pigs on this farm.

Page 9: Number seven of a series

04/19/2023 9Out-of-the-Box Computing Patents pending

The Mill protection model

You can see only what I give you

I can see only what you give me

Fast, cheap, no third-parties

Page 10: Number seven of a series

04/19/2023 10Out-of-the-Box Computing Patents pending

What about the OS?

The operating system is an application- like any other.

There are no privileged operations.

There is no Supervisor Mode.

All protection is by memory address.

Byte address.

Page 11: Number seven of a series

04/19/2023 11Out-of-the-Box Computing Patents pending

A review

protection vs. translation

No longer coupled

Page 12: Number seven of a series

04/19/2023 12Out-of-the-Box Computing Patents pending

load/store FUsretire stations

CPU core decode

I$0e I$0f

D$1 I$1e

L$2

Harvard level 1

shared level 2

DRAM ROMMMIO

device controllers

devices

I$1f

iPLB

TLB

dPLB

View is representative. Actual hierarchy is configured in each chip specification.

The Mill uses virtual caching and the single address space model.

Memory hierarchy from 40,000 ft.

Page 13: Number seven of a series

04/19/2023 13Out-of-the-Box Computing Patents pending

load/store FUsretire stations eI$0 fI$0

D$1 eI$1

L$2

Harvard level 1

shared level 2

TLB

DRAM ROMMMIO

device controllers

devices

dPLB iPLB

fI$1

virtual addresses

physical addresses

The Mill uses virtual caching and the single address space model.

Memory hierarchy from 40,000 ft.

Page 14: Number seven of a series

04/19/2023 14Out-of-the-Box Computing Patents pending

Memory model

Mill:

All tasks use the same virtual addresses, no aliasing or translation across tasks or OS.

Program addresses must be translated to physical addresses before being looked up in cache.

Traditional:bottleneck

loadoperation

translation/protection lines regs

fault

virtualaddress

physicaladdress

cache CPUTLB

data

load operation

protection

lines belt

fault

PLB

CPUcachedata

virtualaddress

Page 15: Number seven of a series

04/19/2023 15Out-of-the-Box Computing Patents pending

Why put translation in front of the cache?

bottleneck

loadoperation

translation/protection lines regs

fault

virtualaddress

physicaladdress

cache CPUTLB

data

To fit in 32-bit memory, different programs must overlap addresses (aliasing). Translation gives each program private memory, even while using the same bit patterns as pointers.

The cost:On the critical path, TLBs must be very fast, small, and power-hungry, and frequently multilevel. Big programs can see 20% or more TLB overhead.

Traditional

Page 16: Number seven of a series

04/19/2023 16Out-of-the-Box Computing Patents pending

Why put translation after the cache?

Mill:

All tasks use the same virtual addresses, no aliasing or translation across tasks or OS.

load operation

protection

lines belt

fault

PLB

CPUcachedata

virtualaddress

TLB out of critical path, only referenced on cache misses and evicts; can be big, single-level, and low power.

Pointers can be passed to OS or other tasks without translation; simplifies sharing and protection for apps.

Protection checking done in parallel with cache access.

Page 17: Number seven of a series

04/19/2023 17Out-of-the-Box Computing Patents pending

The address space

60 bits

The other four bits in a pointer are not part of the address.

LWB UPB

Regions are parts of space. Regions may overlap.

A region has byte granularity

Regions are parts of the address space, not of memory.

The whole potential data space of a program, included unallocated heap, may be one region.

0

max

Page 18: Number seven of a series

04/19/2023 18Out-of-the-Box Computing Patents pending

Region descriptors

LWB UPB rightsregion desc:

0

maxRegions have descriptors, kept in OS tables and cached in the PLB.

IDs

A descriptor gives:

A user matching the identifications can reference the location in the way indicated by the rights.

locationaccess rightsidentifications

read write execute portal …

Page 19: Number seven of a series

04/19/2023 19Out-of-the-Box Computing Patents pending

Turf – a collection of regions

regions

turf

A turf has a non-forgeable, globally unique id.

LWB UPB rights

region desc: turf

ID

address space

Region descriptor turf ids may be wild-carded.

A region descriptor contains only one turf id.But the same region can have several descriptors with different turf ids.

Each region descriptor carries a turf id

A turf comprises all regions with descriptors carrying the turf id.

Page 20: Number seven of a series

04/19/2023 20Out-of-the-Box Computing Patents pending

Threads – lines of execution

turf 5

A thread runs in a turf – one turf at a time, but can change.

Page 21: Number seven of a series

04/19/2023 21Out-of-the-Box Computing Patents pending

Threads – lines of execution

turf 17turf 5

A thread runs in a turf – one turf at a time, but can change.

Note that the descriptors of a turf can describe overlapping regions, possibly with different rights.Note that the descriptors of two different turfs can describe the same region, possibly with different rights.

Page 22: Number seven of a series

04/19/2023 22Out-of-the-Box Computing Patents pending

Threads – lines of execution

turf 17turf 5

A thread runs in a turf – one turf at a time, but can change.

While running in turf 5 While running in turf 17

A thread can see and use

A register holds the current turf ID for the thread.

Many threads can be in the same turf concurrently.

Page 23: Number seven of a series

04/19/2023 23Out-of-the-Box Computing Patents pending

Threads – lines of execution

regions

region desc: LWB UPB rights

turf ID

address space

thread ID

A thread also has a unique non-forgeable global id.

At power-up, hardware starts an initial thread in the All region, the whole 60-bit address space with all rights.

Your vision increases as you approach the All.Swami Suchananda

A region belongs to a thread if the thread id is in the descriptor.

Region descriptor thread ids may be wild-carded.

Page 24: Number seven of a series

04/19/2023 24Out-of-the-Box Computing Patents pending

Granting

Each thread runs in a turf, and has the rights of every region of that turf, as well as the thread’s own rights.

A thread can grant a subset of one of its regions to another turf or thread, with a subset of its rights.

LWB UPB R/Wturf 17

thread *

LWB UPB Rturf 22

thread 5

Granted region descriptors are pushed to the PLB.

granted desc:

owned desc:

Grant is a hardware operation.

Page 25: Number seven of a series

04/19/2023 25Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Newly granted region descriptors have a Novel bit in the PLB.

Page 26: Number seven of a series

04/19/2023 26Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Evicted novel descriptors are copied to the Table.

Page 27: Number seven of a series

04/19/2023 27Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Novel bit is not set in descriptors loaded from the table.

Page 28: Number seven of a series

04/19/2023 28Out-of-the-Box Computing Patents pending

The Region Table

Region descriptors are kept in the Region Table in memory and cached in the PLB. The table is an Augmented Interval Tree (Cormen 2001) searched by address range. Insertion, deletion and search are logN.

PLB Region Table

Evicted non-novel descriptors are discarded.

Page 29: Number seven of a series

04/19/2023 29Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Region descriptors pushed to the PLB have the Novel bit set.

Page 30: Number seven of a series

04/19/2023 30Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Revoked novel descriptors are simply discarded.

Page 31: Number seven of a series

04/19/2023 31Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Descriptors loaded from the table have the Novel bit clear.

Page 32: Number seven of a series

04/19/2023 32Out-of-the-Box Computing Patents pending

Revocation

Granted regions may be revoked, implicitly or explicitly.

PLB Region Table

Non-novel descriptors are discarded in the PLB.

And lazily removed from the table

By use of the Novel bit, the great majority of transient grants exist only in the PLB and never go to the Table.

Page 33: Number seven of a series

04/19/2023 33Out-of-the-Box Computing Patents pending

Avoiding the PLB

Well Known Regions

Page 34: Number seven of a series

04/19/2023 34Out-of-the-Box Computing Patents pending

Avoiding the PLB

Every turf has three Well Known Region descriptors held in registers, not in the PLB: code, data, and constant pool

load module

mapped in memory

binary code

constantsinitialize

d

cpRegcppRe

g

dpReg data

Well Known Regions are created by the loader.

Page 35: Number seven of a series

04/19/2023 35Out-of-the-Box Computing Patents pending

Avoiding the PLB

Every thread has two Well Known Region descriptors held in registers, not the PLB: stack and Thread Local

frame

spReg

fpReg

base

frame

frame

frame

frame

The stack region covers only between base and spReg.

limit

The stack region dynamically adjusts to track call/return.

load(ptr,,,)

load(ptr,,,)

data stack region

stack:

Page 36: Number seven of a series

04/19/2023 36Out-of-the-Box Computing Patents pending

Avoiding the PLB

Every thread has two Well Known Region descriptors held in registers, not the PLB: stack and Thread Local

frame

spReg

fpReg

base

frame

frame

frame

frame

limit

stack:Hardware initializes every new frame to zero.(see http://ootbcomp.com/docs/Memory)

Beyond the top is inaccessible.

You cannot browse in stack rubble.

Nor can anyone else.

Page 37: Number seven of a series

04/19/2023 37Out-of-the-Box Computing Patents pending

Smash and grab

stack protection

Page 38: Number seven of a series

04/19/2023 38Out-of-the-Box Computing Patents pending

Smash and grab

Return-oriented programming is an exploit that permits an attacker to execute arbitrary code, even if all code is in a ROM and the hardware prevents execution of data.

It works by smashing the stack (typically a buffer overrun) and then changing the return addresses saved on the stack to point to the desired instructions already in memory.

The target instruction(s) must be followed by a return instruction, which follows another modified address on the stack to the next instructions the attacker wants to execute.

Various defenses make these attacks harder to do

None make them impossible.

Page 39: Number seven of a series

04/19/2023 39Out-of-the-Box Computing Patents pending

Mill spiller stack

The Mill has a stack for application data

frame

frame

frame

frame

frame

stack region

Page 40: Number seven of a series

04/19/2023 40Out-of-the-Box Computing Patents pending

Mill spiller space

frame

frame

frame

frame

frame

stack region

Mill program state is not kept on the data stack.

spiller

enginecore

spiller space

state

data

Return addresses and other state are in spiller space, not in the app.

Return-oriented exploits are impossible on a Mill

Page 41: Number seven of a series

04/19/2023 41Out-of-the-Box Computing Patents pending

How about debuggers?

Apps cannot see the call chain. Whence a backtrace?

app space trace space

spiller space

stackspiller

Trace Service

The Trace Service is a callable API that has read rights in Spiller space.

Trace will return spill state information about a frame to anyone who has read rights to the frame.

Application

Page 42: Number seven of a series

04/19/2023 42Out-of-the-Box Computing Patents pending

Service-oriented programming

services

Page 43: Number seven of a series

04/19/2023 43Out-of-the-Box Computing Patents pending

Service-oriented programming

A service is a secure, stateful, callable behavior provider.secure, stateful, callable

A service is secureYou can’t tromp on it; it can’t tromp on you.

A service is statefulIt remembers what it was doing for you.It may still be working for you while you’re gone.

A service is callableYou reach it by a normal function call, not a task switch.

The cost is two cache loads per call

Page 44: Number seven of a series

04/19/2023 44Out-of-the-Box Computing Patents pending

Service access

A service function is accessed via a portal. A pointer to a portal can be called like any other function pointer.

Portal layout:entry

turf id

data code pool …

The portal is one I$1 cache line, and one fetch to access.The whole line must have Portal permission.

A portal call:

• Spiller saves the Well Known Region descriptors• Loads the WKR descriptors from the portal• Switches the turf ID register to the new turf• Calls to the entry address normally

Page 45: Number seven of a series

04/19/2023 45Out-of-the-Box Computing Patents pending

Service access

A portal call is not a process switch or thread switch.

thread 17

frame

code

stack

state

call

turf 9

WKR

frame

application

Page 46: Number seven of a series

04/19/2023 46Out-of-the-Box Computing Patents pending

Service access

A portal call is not a process switch or thread switch.

thread 17

frame

code

stack

state

turf 9

WKRframe

service

portal

turf 5

code

state

turf 5

application

call

PLB entry

frame

After a portal call, the same thread is running service code in the service environment, with no old access.

Page 47: Number seven of a series

04/19/2023 47Out-of-the-Box Computing Patents pending

But that doesn’t quite work…

frame

stack

frame

application

frame

You can get a fragmented stack if an application and a service call each other back, or services cross-call.

service

frame

frame

frame

frame

Lots of little regions gives poor PLB performance.

Also: what happens on stack overflow?

A service should not be faulted just because a caller was close to its limit.

Page 48: Number seven of a series

04/19/2023 48Out-of-the-Box Computing Patents pending

Stacklets

A thread in a service needs its own stack.

stack

application

service A

service B

The logical stack of each thread is a chain of stacklets, one for each turf entered by a nested portal call.

portal call

portal call

But - how can you allocate a stacklet in the middle of a portal call?

Page 49: Number seven of a series

04/19/2023 49Out-of-the-Box Computing Patents pending

Stacklets

A stacklet per turf solves the callback problem.

frame

application service

frameWKR

application portal-calls service

WKR

Page 50: Number seven of a series

04/19/2023 50Out-of-the-Box Computing Patents pending

Stacklets

A stacklet per turf solves the callback problem.

frame

application

frame

service

frameWKR

service back-calls application

WKR

Page 51: Number seven of a series

04/19/2023 51Out-of-the-Box Computing Patents pending

Stacklets

A stacklet per turf solves the callback problem.

frame

application

frame

service

frame

WKR

application re-calls service (nested)

WKR

frame

All frames of a turf/thread combination are adjacent in the stacklet; only one stack-WKR needed.

But – how can you allocate a stacklet in the middle of a portal call?

Page 52: Number seven of a series

04/19/2023 52Out-of-the-Box Computing Patents pending

Stacklet allocation

Stacklets are universally allocated at computed addresses.

One sixteenth of the address space is reserved for stacklets.

stacklets

Stacklets are laid out as a two-dimensional array indexed by turf and thread ID.

thread14 15 1716 1918

turf

34

35

36

37

{thread 17 in turf 36}

Page 53: Number seven of a series

04/19/2023 53Out-of-the-Box Computing Patents pending

thread ID turf ID 0 63 59 55 33 11 0

0xf

Stacklet allocation

Stacklets are universally allocated at computed addresses.

stack

portal call to turf 94KB

thread 17 in turf 5

0xf00004400005000

stacklet address

0xf00004400009000

A portal call allocates address space.The space is implicitly zero.

http://ootbcomp.com/docs/memory

Page 54: Number seven of a series

04/19/2023 54Out-of-the-Box Computing Patents pending

What about callbacks?

If every portal call started a new stack at the thread/turf address, then a callback would put its stack on top of the previous stack:

portal call to turf 9

thread 17 in turf 5

portal call back to turf 5

Oops!

Page 55: Number seven of a series

04/19/2023 55Out-of-the-Box Computing Patents pending

The stacklet info block

Associated with each stacklet, and also at a computed address, is a cache-line sized info block with metadata.

TOS base limit

The values are offsets from the computed stacklet address, biased so that all are zero for an unused stacklet.

A portal call writes the current stack WKR to the info blockand fetches the new info block to update the stack WKR.

Because “empty” is all zero, and unbacked loads are implicitly zero, an unused stacklet is empty.

A portal call costs two fetches: the portal and the info block.

Page 56: Number seven of a series

04/19/2023 56Out-of-the-Box Computing Patents pending

Arguments

Some function arguments are passed in memory. The pass operation lets the callee see private caller data.

write(fd, ptr, len)

appapp turf

buffer

file service

service turf

pass(ptr, len, r);

pass() pushes a transient thread permission to the PLB

pass() regions are removed by the matching return()

Page 57: Number seven of a series

04/19/2023 57Out-of-the-Box Computing Patents pending

pass() considered nuisance

Most pass() calls will be buried in APIs.

program says: write(fd, ptr, len);// calls RTS

RTS says: pass(ptr, len, r);writeSvc(fd, ptr, len);

// portal calls service

If many independent pieces must be passed (like a graph):

1. Allocate an arena2. Build graph in arena3. Pass whole arena

Page 58: Number seven of a series

04/19/2023 58Out-of-the-Box Computing Patents pending

Implicit pass()

Mill frame structure

fp

sp

Explicit use of pass() works when the passed region is explicit in the source, but sometimes the callee must access implicitly passed data too: structs passed or returned by value; excess arguments; VARARGS, etc.

The args() operation reserves a portion of the frame for implicit arguments, setting the outp register.

args({size in bytes});// initializationportal call(…);

outpinp The portal call implicitly passes the region between inp and the old sp.

Page 59: Number seven of a series

04/19/2023 59Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

service

turfs

Page 60: Number seven of a series

04/19/2023 60Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

service

service portal calls

Page 61: Number seven of a series

04/19/2023 61Out-of-the-Box Computing Patents pending

Motivating example – buggy drivers

Device drivers need access to special parts of memory to make the device work – MMIO, on-device buffers, etc.

They shouldn’t have access to the OS or application state.

Ideally, each driver should be its own process, with relevant device-specific memory regions mapped in.

application

OS driverdevic

e

service

Simple, clean – and cheap

Page 62: Number seven of a series

04/19/2023 62Out-of-the-Box Computing Patents pending

A caution

Mill security regions are very big – and few. They secure entire data spaces and programs, not objects and functions.

Fine-granularity security, in which individual objects and functions can be isolated, requires a different model.

We wish the Mill could support object-level security, but that would require non-standard, non-commodity memory, and would break nearly every C program.

’Tis true, ’tis pity. And pity ’tis, ’tis true.Wm. Shakespeare

We wouldn’t sell any.

Page 63: Number seven of a series

04/19/2023 63Out-of-the-Box Computing Patents pending

Summary #1:

The Mill:

• The Mill uses virtual caching and the Single Address Space model.

• There are no privileged operation and no supervisor state; all protection is by address.

• Security environments, called turfs, can include arbitrary regions of the address space, with arbitrary rights.

• A thread can grant a region to a turf or another thread with a subset of its own rights.

Page 64: Number seven of a series

04/19/2023 64Out-of-the-Box Computing Patents pending

Summary #2:

The Mill:

• Grants may be implicit or explicit, and can be revoked.

• Protection management uses OS tables and a hardware cache of region descriptors.

• Nearly all access has protection checked using one of five Well Known Regions held in hardware registers, rather than the more expensive general mechanism.

• A call across protection domains (portal call) costs two fetches more than a regular call.

Page 65: Number seven of a series

04/19/2023 65Out-of-the-Box Computing Patents pending

Summary #3:

The Mill:

• Both caller and callee are protected from each other.

• The return addresses of calls are safe from stack smashing exploits.

• The OS is yet another service.

• It costs nothing to remove device drivers from privileged state and the OS.

Page 66: Number seven of a series

04/19/2023 66Out-of-the-Box Computing Patents pending

Shameless plug

For technical info about the Mill CPU architecture:

http://ootbcomp.com/docs

To sign up for future announcements, white papers etc.

http:/ootbcomp.com/mailing-list