a deep dive into clojure's data structures - euroclojure 2015

Post on 03-Aug-2015

450 Views

Category:

Software

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

What Lies Beneath

Mohit Thatte

EUROCLOJURE 2015Barcelona

A Deep Dive into Clojure’s data structures

@mohitthatte @pastafari

A DAY IN THE LIFE

Image: User:Joonspoon Wikimedia Commons

Programs that use Maps

Map API

Map Implementation

Primitives (JVM, et al)

TOWERS OF ABSTRACTION

“Any sufficiently advanced data structure is indistinguishable from magic”

- Me

With apologies to Arthur Clarke

IMMUTABILITY IS GOOD

PERFORMANCE IS NECESSARY

By U.S. Navy photo [Public domain], via Wikimedia Commons

IMMUTABILITY

PERF

Image: Maj. Gen. William Anders, Apollo 8

“… functional programming’s stricture against destructive updates (assignments)

is a staggering handicap, tantamount to confiscating a master chef’s knives.”

- Chris Okasaki

ABSTRACT DATA TYPE

enqueue add an element to the end

head first element

tail remaining elements

QUEUE

INTERFACE INVARIANTS

NAME

THE CHALLENGE

Correct

Performant

ImmutableX

CHALLENGE ACCEPTED

Structural Sharing

KEY IDEAS

Structural Bootstrapping

Hybrid Structures

STRUCTURAL SHARING

:a :b :c :d :e

(assoc v 2 :zz)

:a :b :zz

STRUCTURAL SHARING

:c

:a

:d

:f

:m

(assoc v 4 :zz)

:e:b

:d

:f

:zz

Image: Alan Levine

STRUCTURAL DECOMPOSITION

Image: Alan Chia (Lego Color Bricks)

HYBRID STRUCTURES

LETS DIVE IN!

‘(1 2 3) Lists: Code manipulation

[1 2 3] Vectors: All things sequential

{:a 1 :b 2} Maps: Structured Data

#{\a \e \i \o \u} Sets: Ermm, Sets

CLOJURE DATA STRUCTURES

MAPS

GET GET value for given key

ASSOC ADD key,value to map

DISSOC REMOVE key,value from map

MERGE MERGE two maps together

THE MAP INTERFACE

WHAT MAKES A GOOD MAP?

Constant time operations independent of number of keys

Efficient space utilization even with mutation

Objects as keys, Objects as values

IDEAS

ARRAYS

IDEA #1

:a 1 :b 2 :c 3

KEY VALUE PAIRS

NOT A GREAT MAP!

Time complexity O(n)

Space efficiency NO

Objects as keys YES

HOW DO WE DO BETTER?

Image: www.pooktre.com

TREES TO THE RESCUE

Ramon Llull, Catalunya c. 1250

Arbol de ciencia

IDEA #2

BINARY SEARCH TREE

13 a

8 f 17

1 11q b

6 z

15 s

r

n25

t22 u27

13 a

17

m

r

25

u27

NOT A GREAT MAP!

Time complexity worst case O(n)

Space efficiency POSSIBLY

Objects as keys YES

How do we keep our trees in ‘balance’?

IDEA #3

BALANCED BINARY SEARCH TREES

RED BLACK TREES

ALWAYS BALANCED, 100 % MONEY BACK GUARANTEE

Guibas, Sedgwick 1978

RED BLACK TREES

Root is black

Every path from root to an empty node contains the same number of black nodes

Every node is colored red or black

No red node can have a red child

RED BLACK TREES

Okasaki ‘96

A PRETTY GOOD MAP!

Time complexity O(log2N)

Space efficiency YES

Objects as keys YES

Clojure’s sorted-maps are Red Black Trees

CONSTRAINTS

KEYS MUST BE COMPARABLE

KEYS ARE COMPARED AT EVERY NODE, THIS CAN BE EXPENSIVE

IDEA #4

TRIE - SEARCH BY DIGIT

t apLEVEL 0

LEVEL 1

LEVEL 2

next(node, symbol)

FINITE STATE MACHINE

Symbols #{a..z}

Nodes, Edges

TRIE IMPLEMENTATIONS

Associate each symbol with an offset, e.g a=0,b=1,…

LOOKUP TABLES

next = lookup(node, offset)

Fast and space efficient trie searches, Bagwell 2000

ADD

NOT A GREAT MAP!

Time complexity O(logmN)

Space efficiency NO

Objects as keys NO

How do we avoid null nodes?

IDEA #4

BST + TRIE = TSTBentley, Sedgwick 1998

Fast and space efficient trie searches, Bagwell 2000

ADD

A DECENT MAP

Time complexity ~O(log2N)

Space efficiency YES

Objects as keys NO

No null nodes, but can we do better

than log2N?

CHALLENGE ACCEPTED

Fast and space efficient trie searches, Bagwell 2000

Array Mapped Trie

IDEA #5

Use bitmaps to determine presence or absence

of symbol

Lets say we have 16 symbols, 0…15

0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0

USING BITMAPS

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Does the symbol with offset 6 exist?

mask = 1 << offset bitmap & mask

0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

bitwise AND with a mask

There’s an array alongside that only contains entries

for the 1’s. NOT pre-allocated.

What offset in the dynamic array should I look at?

Image: Martin Fisch, flickr.com

USE THE 1’S AS TALLY MARKS

0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0

0 1 2 3 4

MapEntry MapEntrySubTrie Pointer MapEntry MapEntry

0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0

USING BITMAPS15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Where in the array is the entry for ‘6’?

Integer.bitCount(bitmap & mask)

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

Count tally marks to the ‘right’ of offset

mask = (1 << 6 ) - 1How do I create a mask to do that?

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

What happens if I insert a new map entry?

0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0

0 1 2 3 4

MapEntry MapEntry MapEntry MapEntry MapEntry

0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0

0 1 2 3 4 5

Map Entry

Map Entry

SubTrie Pointer

Map Entry

Map Entry

Map Entry

A DECENT MAP

Time complexity O(logmN)

Space efficiency YES

Objects as keys NO

How do we support arbitrary

Objects as keys?

Ideal hash trees, Bagwell 2001

Hashing + AMT

IDEA #6

Ideal hash trees, Bagwell 2001

Use a good hash function to generate an integer key.

STEP 1

0010 1101 1011 1110 1100 1111 1111 1001

hasheq

STEP 2

72021 35

Divide the 32 bit integer into ‘symbols’ 5 bits at a time.

00101 001111010010101 000110100101

11

Use the ‘symbols’ to walk down an AMT

t bits per symbol give

2t symbols

Why 5 bits?

BIT JUGGLING!Compute ‘symbols’ by shifting and masking

0011100011001011010010101010010100 00000 00000 00000 00000 00000 11111

(hash >>> shift) & 0x01f

How to calculate nth digit?

Shift by 5*n and mask with 0x1f

BEST COMMENT EVER.

A persistent rendition of Phil Bagwell's Hash Array Mapped Trie Hickey R., Grand C., Emerick C., Miller A., Fingerhut A.

Uses path copying for persistence HashCollision leaves vs. extended hashing Node polymorphism vs. conditionals No sub-tree pools or root-resizing Any errors are my own

PersistentHashMap.java:19

NODE POLYMORPHISM

ArrayNode - 32 wide pointers to sub-tries

BitmapIndexedNode - bitmap + dynamic array

HashCollisionNode - array for things that collide

EXAMPLE

(let [h (zipmap (range 1e6) (range 1e6))] (get h 123456))

10111 111001100101001 0001028259 223

0101100000110

shift = 0ArrayNode

ArrayNodeshift = 5

ArrayNodeshift = 10

BitmapIndexedNodeshift = 15

… and then follow the AMT down

A GOOD MAP

Time complexity O(log32N)

Space efficiency YES

Objects as keys YES

Key compared only once

Bit juggling for great performance!

HAMT

~6 hops to a leaf node

NEED ROOT RESIZING

NOT AMENABLE TO STRUCTURAL SHARING

REGULAR HASH TABLE?

UPDATES?

Search for the key, clone leaf nodes and path to root

VECTORS

ArrayNode’s all the way. Break ‘index’ into digits and walk down levels.

INTUITION

(let [arr (vec (range 1e6))] (nth arr 123456))

030 182400

shift = 15ArrayNode

ArrayNodeshift = 10

ArrayNode

shift = 5

ArrayNodeshift = 0

00011 000001001011000000000000000000

123456

THE TAIL OPTIMIZATIONPersistentVector

count shift root tail

RIGHT TOOL FOR THE JOB

By Schnobby (Own work) [CC BY-SA 3.0], via Wikimedia Commons

HashMaps do not merge efficiently

data.int-mapMAP CATENATION

Okasaki & Gill’s “Fast Mergeable int maps”

Zach Tellman

Vectors do not concat efficiently

Vectors do not subvec efficiently

VECTOR CATENATION

Based on Bagwell and Rompf, “RRB-Trees: Efficient Immutable Vectors”

logarithmic catenation and slicing

Michal Marczyk

core.rrb-vector

TODO: benchmarks

CTRIESMichál Marczyk

Tomorrow at 0850

1959 Birandais, Fredkin Trie

1960 Windley,Booth, Colin,Hibbard Binary Search Trees

1962 Adelson-Velsky, Landis AVL Trees

1978 Guibas, Sedgwick Red Black Trees

1985 Sleator, Tarjan Splay Trees

1996 Okasaki Purely Functional Data Structures

1998 Sedgwick Ternary Search Trees

2000 Phil Bagwell AMT

2001 Phil Bagwell HAMT

2007 Rich Hickey Clojure!

Reading List

Ideal Hash Trees, Bagwell 2001

Fast and efficient trie searches, Bagwell 2000

Fast Mergeable Integer Maps, Okasaki & Gill, 1998

The worlds fastest scrabble program, Appel & Jacobson, 1988

File searching using variable length keys, Birandais, 1959

Purely Functional Data Structures, Okasaki 1996

Polymatheia: Jean Niklas L’Orange

QUESTIONS?

Ask Michal or Zach or Jean Niklas :)

THANK YOU

top related