structurally tackling input problems

Structurally tackling

input problems

Erik Poll

Digital Security

Radboud University Nijmegen

This week

Approaches to structurally root out / tackle input problems:

• LangSec (Language-theoretic Security)

• Using types to keep track of

• different kinds of data and

• different trust levels

As exemplified by Google’s Trusted Type initiative,

which allows API hardening in the browser.

Two related, recurring themes:

• different languages/formats/protocols/notations/encodings/…

• parsing of such formats, notations, encodings, …

2

Story so far

Most security problems arise due to inputmemory corruption (buffer overflows, NULL dereferencing, use-after-free, …)

integer overflows

race conditions aka TOCTOU aka non-atomic check & use

‘injection attacks’:

SQL injection, path traversal, command injection, HTML injection, XSS,

format string attacks, SSI injection, LDAP injection, XPath injection,

XXE, deserialization attacks, Macros in Word and Excel, XML/Zip bombs,

uploading .exe files, injecting PHP files, …

…

3

Solutions so far (discussed last week)

We can prevent some input problems with

• Validation of inputs

• Sanitisation of inputs or outputs

• Making sure to do canonicalisation whenever we do this

• Better still: use safe(r) APIs

– eg Prepared Statements aka Parameterised Queries

Of course, in addition to trying to prevent problems,

we should also try to detect them and mitigate potential impact

4

Last week: validation ≠ sanitisation

• Sanitisation (eg replacing < with <) is a compensation for a

weakness in the design

– Need to sanitise comes from choice to use certain technologies/APIs,

eg SQL, XML, HTML, JavaScript

– Need is external to the use case, but intrinsic to technologies/APIs

used

• Validation (eg rejecting 31/11/2021 as date) is not

– Need to reject invalid data stems from the use case/application

• So validation of input is needed even if our APIs are immune to

injection attacks

– Need is inherent to the use case, external to the software

5

Two classes of input flaws

6

The I/O attacker model ( = ‘hacking’)

Garbage In, Garbage Out (GIGO) quickly becomes

Malicious Garbage In, Security Incident Out

Attacker goals:

• Remote code execution, DoS, or anything in between

Means: exploiting unwanted behaviour, which can be

– weird, buggy behaviour (eg buffer overflow overwriting return address, integer overflow, ROP, …)

– unwanted access to/triggering of normal behaviour (eg SQLi, XSS, Word Macros, …)

7

applicationmalicious input

I/O

Two types of security flaws

8

(abuse of)

a feature !2. Forwarding/Injection Flaws

back-end

service

malicious

input

eg SQL

query

application

applicationmalicious

input

a bug !1. Processing Flaws

eg buffer overflow

in PDF viewer

Bugs vs features

9

(abuse of)

a feature !

back-end

service

malicious

input

eg SQL

query

application


input

a bug !

eg buffer overflow

in PDF viewer

1. Processing Flaws

2. Forwarding/Injection Flaws

Two types of input problems

1. Buggy parsing & processing

– Bug in processing input causes application to go of the rails

– Classic example: buffer overflow in a PDF viewer, leading to remote

code execution

This is unintended behaviour, introduced by mistake

2. Flawed forwarding (aka injection attacks)

– Input is forwarded to back-end service/system/API, to cause damage

there

– Classic examples: SQL injection, path traversal, XSS, Word macros

This is intended behaviour of the back-end, introduced

deliberately, but exposed by mistake by the front-end

10

Parsing is always involved

11

unwanted parsing

eg. of user input as SQL

back-end

service

malicious

input application


input

buggy parsing

eg. of a pdf file

1. Processing Flaws

2. Forwarding/Injection Flaws

Parsing is always involved

12

application

pa

rse

r

back-end

service

application

pa

rse

r

sometimes:

unparsing

sometimes:

additional

parsing

Theme: (not) treating user input as code

Key observation: safe APIs such as Parameterised Queries

• prevent treating user data as ‘commands’

• do not parse/interpret/execute user data as some form of code

– In a classic buffer overflow user data is treated as code,

as user-supplied shell code is executed as binary code.

– Word macros are clearly commands, but unzipping is also a limited

form of execution

13

ANTI-PATTERN : Treating user data as commands/CODE

More back-ends, more languages, more problems

14

SQL

databasemalicious

input

web

server

OS

web

browser

XSS

command

injection

SQL

injection

file

systempath

traversal

format

string attack C library

LangSec

(language-theoretic security)

LangSec

• Interesting look at root causes of large class of security problems,

namely problems with input

• Useful suggestions for dos and don’ts

• The ‘Lang’ in ‘LangSec’ refers to input languages,

. not programming languages.

Though some input languages will include programming languages…

16

Sergey Bratus & Meredith Patterson

‘The science of insecurity’

CCC 2012

Motivation: the never ending story where

• Attackers keep discovering

– new bugs or features to exploit

– new input possibilities or data flows to trigger these

– new ways to by-pass protection mechanisms

• Defenders keep

– patching bugs

– adding or tweaking protection mechanisms

• adding validation and sanitisation

• adjusting allow- & deny-lists used to validate or sanatise

• adding VPNs, firewalls, WAF, anti-virus, …

– adding features

Can’t we write code and use components, technologies, and APIs that are inherently robust & secure, by construction?

17

Example issue with sanitisation aka escaping aka encoding

Example: Chrome used to crash on the URL http://%%30%30

• %30 is the URL-encoding of the character 0

• So %%30%30 is the URL-encoding of %00

• %00 is the URL-encoding of null character

So %%30%30 is a double-encoded null character

Some code deep inside Chrome performed a second URL-decoding (as a

well-intended ‘service’ to its client code?) and then some other code crashes

on the resulting null character.

How could this bug have been detected, statically or dynamically?

Or prevented by better design?

Moral: having encoded data around makes validation harder!

Note that encoding is the opposite of canonicalisation:

it introduces different representations of the same data.

18

Windows supports many notations for file names

• classic MS-DOS notation C:\MyData\file.txt

• file URLs file:///C|/MyData/file.txt

• UNC (Uniform Naming Convention) \\192.1.1.1\MyData\file.txt

which can be combined in fun ways, eg file://///192.1.1.1/MyData/file.txt

Some notations cause unexpected behaviour by involving other protocols, eg

• UNC paths to remote servers are handled by SMB protocol

• SMB sends password hash to remote server to authenticate: pass the hash

This can be exploited by SMB relay attacks ……- CVE-2000-0834 in Windows telnet ……

……- CVE-2008-4037 in Windows XP/Server/Vista

……- CVE-2016-5166 in Chromium ……

……- CVE-2017-3085 & CVE-2016-4271 in Adobe Flash …

……- ZDI-16-395 in Foxit PDF viewer

[Example thanks to Björn Ruytenberg, https://blog.bjornweb.nl]

Example: surprising complexity & expressivity – file names

19

Root cause: The Tower of Babel

A typical interaction with software, say on the Web, involves

many languages, formats, and protocols:

HTTP, HTML, CSS, JavaScript, URLs

DNS, TLS, X509 certificates, TCP/IP (IPv4 or IPv6),

jpeg, mpeg, mp4, png, gif, pdf, .docx,

user names, email addresses, .ics ,

file names, directories, OS commands,

SQL, LDAP, JSP, PHP, XML, JSON, …

ASCII, Unicode, UTF-8, ... Ethernet,

Wifi, Bluetooth, GSM/3G/4G/5G, ..

Some handled – parsed - by app or browser,

some by lower protocol layers,

some by external programs & services

This provides a HUGE attack surface of HUGE complexity

20

Data gets en/decoded or (un)parsed as it moves through the

technology stack

App

Attack surface, for eg buffer overflow or injection

21

information: 19th of November

Date

Wifi / 4G

TCP/IP

HTTP

TLS

Ethernet

TCP/IP

HTTP

TLS

serialising/

unparsing/

pretty printingeg with toString

Server

Date

de-serialising/

parsing

database

OS

file system

Data gets en/decoded or (un)parsed as it moves through the

technology stack

App

Attack surface, for eg buffer overflow or injection

22

Wifi / 4G

TCP/IP

HTTP

TLS

Ethernet

TCP/IP

HTTP

TLS

Server

HTML

renderer

image

library

pdf

viewer

Root causes identified by LangSec

1. (Too) many languages & formats

2. These languages are often complex & unclearly defined andcombined

3. The code handles all these languages & formats in sloppy way,

– as the succes of fuzzing demonstrates

– as the prevalence of memory corruption & injection attacks shows

23

Processing input

Processing involves

1) parsing/lexing

2) interpreting/executing

Eg interpreting a string as filename, URL, or email address,

or executing a piece of OS command, javascript, SQL statement

This relies on some language or format

Step 1) above relies on syntax of this language

Step 2) above relies also on semantics of this language

24

Processing input is dangerous!

Different ways for an attacker to abuse input

• wasting resources (e.g. a zip-bomb)

• crashing things (and causing DoS)

• abusing unwanted functionality that is accidentily exposed

Such functionality can be

– Normal functionality of eg. SQL database or the OS,

– Bizarre, buggy functionality caused by eg. a buffer overflow

Buggy processing of inputs provides a weird machine that the attacker can

‘program’ to abuse the system

Classic example: ROP (return oriented programming)

25

Example problem: combining languages

X509 certificates involve several languages & formats.

Differences in interpretation caused various security flaws:

• Multiple Common Names

Eg certificate for facebook.com, mafia.com

Handled differently in different browsers. Why is this even allowed?

• ANS.1 attacks

A null terminator in ANS.1 BER-encoded string in an CommonNamecan trick CA into issueing certificate for unauthorized parties

• PKCS#10-tunneled SQL injection

You could have an SQL command inside a BMPString, UTF8String or

UniversalString used as PKCS#10 Subject Name

[Dan Kaminsky, Meredith L. Patterson, and Len Sassaman,

PKI Layer Cake: New Collision Attacks Against the Global X.509 Infrastructure]

26

Anti-pattern: shotgun parsers

Handwritten code that incrementally parses & interprets input, in a

piecemeal fashion

28

An example shotgun parser: spot the defect

char buf1[MAX_SIZE], buf2[MAX_SIZE];

// make sure url is valid URL and fits in buf1 and buf2:

if (!isValid(url)) return;

if (strlen(url) > MAX_SIZE – 1) return;

// copy url excluding spaces, up to first separator, ie. first ’/’, into buf1

out = buf1;

do { // skip spaces

if (*url != ’ ’) *out++ = *url;

} while (*url++ != ’/’);

strcpy(buf2, buf1);

...

29

[Code sample from presentation by Jon Pincus]

Loop fails to

terminate flaw

for URLs without /

Exploited by

Blaster worm

An example shotgun parser: spot the defect

char buf1[MAX_SIZE], buf2[MAX_SIZE];

// make sure url is valid URL and fits in buf1 and buf2:

if (!isValid(url)) return;

if (strlen(url) > MAX_SIZE – 1) return;

// copy url excluding spaces, up to first separator, ie. first ’/’, into buf1

out = buf1;

do { // skip spaces

if (*url != ’ ’) *out++ = *url;

} while (*url++ != ’/’);

strcpy(buf2, buf1);

...

30

[Code sample from presentation by Jon Pincus]

Why not parse the url in

one go into some URL

object or datatype?

Eg as part of the isValid()

method?

Root causes identified by LangSec (revisited)

Obstacles / anti-patterns in producing code without input

vulnerabilities

• Input languages that are too complex

– often caused by unchecked development of input languages,

with eg. standards evolving and adding new features over time

• Ad-hoc & imprecise notion of validity

– causing parser differentials

eg web-browsers parsing X509 certificates in different ways

• Mixing input recognition & processing in shotgun parsers

All this results in weird machines

31

LangSec principles to prevent input problems

No more hand-coded shotgun parsers, but

1. precisely defined input languages

eg with EBNF grammar

2. generated parser

3. complete parsing before processing

Also, don’t substitute strings & then parse,

but parse & then substitute in parse tree

(eg parameterised query instead of dynamic SQL)

4. keep the input language simple & clear

So that bugs are less likely

So that you give minimal processing power to attackers

32

Preventing input problems the LangSec way

33


input

pa

rse

r

LangSec approach:

• Simple & clear language spec

• Generated parser code

• Complete parsing before processing

35

Weird machine = the strange functionality accidentality

exposed by code that (incorrectly) processing input

Attackers can ‘program’ this weird machine with their

malicious input!

Minimise the resources & computing power that input handling gives

to attackers

36

All parsers should be equivalent.

And parsers should be the exact inverse of the pretty printers /

unparsers

37

Tackling injection attacks

(aka forwarding flaws)

40

Recall forwarding flaws

Anti-patterns & patterns ?

41

(abuse of) a feature !

back-end

service

malicious

input

eg SQL

query

application

Anti-pattern: input escaping

• Input escaping, eg. processing inputs to escape dangerous

meta-characters, is a bad idea

– at the point of input, the context in which inputs will be used

(eg as path name, in SQL query, or as HTML) is unclear, and

different contexts require different solutions

• Output escaping makes more sense, because there context is

known

– but there it can be unclear which data originates from input;

we come back to that later

back-end

service

application

pars

er

input validationrejecting invalid input output sanitisation

aka escaping to make output harmless42

Anti-pattern: string concatenation

• Standard recipe for security disaster:

1. concatenate several pieces of data, some of them user input,

2. pass the result to some API

• Classic example: SQL injection

• Note: string concatenation is inverse of parsing

43

Anti-pattern: strings

The use of strings in itself is already troublesome

be it char*, char[], String, string, StringBuilder, ...

• Strings are useful, because you use them to represent many things:

eg. name, file name, email address, URL, bit of SQL, HTML,…

• This also make strings dangerous:

1. Strings are unstructured & unparsed data, and processing

often involve some interpretation (incl. parsing)

2. The same string may be handled & interpreted in many

– possibly unexpected – ways

3. A single string parameter in an API call can – and often does –

hide a very expressive & powerful language

44

Remedy: Parameterised queries

Note that parameterised queries

• reduce the expressive power of the interface to the back-end

• avoid unparsing in front-end

– and hence avoid re-introduce parsing in back-end

• replace an API call that takes a single string as argument

45

pa

rse

r

Remedy: Types (1) to distinguish languages

• Instead of using strings for everything,

use different types to distinguish different kinds of data

Eg different types for HTML, URLs, file names, user names , …

but also URL-encoded vs URL-decoded data,

HTML-encoded vs HTML-decoded data, etc.

• Advantages

– Types provide structured data

– No ambiguity about the intended use of data

46

Remedy: Types (2) to distinguish trust levels

• Use information flow types to track the origins of data

and/or to control destinations

– Eg distinguish untrusted user input vs compile-time constants

The two uses of types, to distinguish (1) languages or (2) trust levels,

are orthogonal and can be combined.

47

Example: fighting XSS

48

HTML injection & XSS

HTML injection: sneaking malicious payload into a victim’s browser &

let it be rendered as HTML

XSS: special case of HTML injection, where the payload is or contains

JavaScript, which is executed by the Javascript engine in the victim’s

browser

For XSS it is normal that input is forwarded multiple times before it

does its damage, in both reflected and stored XSS attacks

• most XSS attacks are at least 2nd order

49

Reflected, stored & DOM-based XSS

• Reflected XSS

Malicious payload goes back & forth from victim’s machines to

JavaScript engine in the browser

• Stored XSS

Attacker lets the web server store his malicious payload and

passes it on to the victim – another user of same website - later

• DOM-based XSS

The malicious payload is passed around by JavaScript inside

browser/app, using DOM API and other APIs, to ultimately end up

in a place where it is rendered as HTML or executed as JavaScript

50

Reflected XSS attack

Attacker crafts malicious URL containing JavaScript

https://google.com/search?q=<script>...</script>

and tempts victim to click on this link

51

malicious

URL web server

HTML response containing

<script> </script>browser

Stored XSS attack

52

Attacker injects HTML - incl. scripts - into a web site, which is stored at that

web site is echoed back later when victim visit the same site

malicious

input

web

server

data

baseanother user

of the same

website

HTML containing

malicious contentbrowser

DOM-based XSS

53

Attacker somehow (reflected or stored) feeds malicious input as parameters

into JavaScript code in the victims browser where it ultimately ends up ends

being interpreted as HTML / executed as JavaScript via DOM API

browser

DOM API library.js

lots of

JavaScript

Background: common encodings on the web

• HTML encoding

< > & ” ’ replaced by > lt; & &quot &#39

• URL encoding aka %-encoding

/ ? = % # replaced by %2F %3F %3D %25 %23

space replaced by %20 or +

Try this out at https://duckduckgo.com/?q=%2F+%3F%3D

• JavaScript string literal encoding

’ replaced by \’

Eg ’this is a JS string with a \’ in the middle’

• CSS encoding

• Unicode encoding

• base 64 encoding

• …

54

Example modern website code

Code from a photo sharing website, with user-chosen album name name

var escapedName = goog.string.htmlEscape(name); // HTML-encoding

var jsEscapedName = goog.string.escapeString(escapedName); // JS string literal encoding

elem.innerHTML = '<a onclick="createAlbum(\' ' + jsEscapedName + '\')">' + escapedName + '</a>';

Idea:

• HTML-encoding with htmlEscape gets rid of scripts in malicious album name

• JavaScript string literal encoding with escapeString makes sure album name

John's Birthday works ok when fed as JS string to JS function createAlbum

55[Example from Christoph Kern, Securing the Tangled Web, CACM 2014]

rendered as HTMLJS string argument

to createAlbum( … )

Spot the XSS bug!

var escapedName = goog.string.htmlEscape(name); // HTML-encoding

var jsEscapedName = goog.string.escapeString(escapedName); // JS string literal encoding

elem.innerHTML = '<a onclick="createAlbum(\' ' + jsEscapedName + '\')">' + escapedName + '</a>';

Attack: malicious name ');attackScript();//

HTML-escaped this becomes ');attackScript();//

JS-escaped this remains ');attackScript();//

So innerHTML becomes

<a onclick= "createAlbum(' ');attackScript();// ')">');attackScript();//</a>

The browser HTML-unescapes value of onclick attribute before evaluation as JS

createAlbum(' ');attackScript();//')

so attackScript(); will be executed

56[Example from Christoph Kern, Securing the Tangled Web, CACM 2014]

Why XSS is so hard to prevent

• Many sources & sinks, with complex data flows between them

• Many different types of data

• URLs, URL parameters, HTML snippets, JavaScript snippets

JavaScript strings, …

with different trust levels

• safe HTML without scripts (ie. already validated and/or escaped)

• unsafe HTML, possibly with scripts

• HTML with scripts that we trust

and different forms of encoding (aka escaping)

• URL-encoding

• HTML-encoding

• JavaString-literal encoding

57

Moral of this example

• Complex data flows are impossible to keep track of:

– Which strings have been escaped/encoded/validated, and how?

– Which strings can be trusted and which can come from an

attacker?

• Programmer cannot keep track of this

• Compiler & type-checker could keep track of this

– but then it needs more type information

58

Trusted Types for DOM Manipulation

Google’s Trusted Types initiative [https://github.com/WICG/trusted-types]

replaces string-based DOM API with a typed API

– using TrustedHtml, TrustedUrl, TrustedScriptUrl,

TrustedJavaScript,…

– ‘hardened’ aka‘safe’ APIs for back-ends which auto-escape

inputs or reject untrusted inputs

Released as a Chrome browser feature

[https://developers.google.com/web/updates/2019/02/trusted-types]

59

Beyond types: extending programming language

Wyvern programming language by Jonathan Aldrich et al.

allows domain-specific extensions, eg

where HTML and SQL are ‘built-in’ types of the programming language

Added advantage over types: more convenient syntax

[D. Kurilova et al, Wyvern: Impacting Software Security via

Programming Language Design, PLATEAU 2014, ACM]

60

Conclusions

Many security problems arise in handling

• buggy parsing

• unintended parsing due to forwarding

Ironically, parsing is a well-understood area of computer science

Constructive remedies to tackle this

• Have clear, simple & well-specified input languages

• Generate parser code

• Don’t use strings

• Do use TYPES, to distinguish languages & trust levels

• Have ‘safe’ or ‘hardened’ APIs, that are immune to injection and

prevent untrusted input being used incorrectly

61

input

To read

• The LangSec manifesto LangSec: Recognition, Validation, and

Compositional Correctness for Real World Security

• Wang et al., If It's Not Secure, It Should Not Compile: Preventing DOM-

Based XSS in Large-Scale Web Development with API Hardening, ICSE'21,

ACM/IEEE, 2021

• Poll, Strings considered harmful, USENIX ;login:, 2018

62

structurally tackling input problems

Documents