r meetup talk

53
Fast lookups in R Joseph Adler April 13 2010

Upload: joseph-adler

Post on 17-Jul-2015

1.086 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: R meetup talk

Fast lookups in R

Joseph Adler

April 13 2010

Page 2: R meetup talk

About me

Relevant work

• Tasks– Computer security research

– Credit risk modeling

– Pricing strategy

– Direct marketing

• Places– American Express

– Johnson and Johnson

– DoubleClick

– VeriSign

– LinkedIn (now)

Page 3: R meetup talk

About me

Books

Page 4: R meetup talk

Today’s talk

What I wrote

If you need to store a big lookup table, consider implementing the table using an environment. Environment objects are implemented using hash tables. Vectors and lists are not. This means that looking up a value with n element in a list can take O(n) time. Looking up the value in an environment object takes O(1) time on average

Page 5: R meetup talk

Today’s talk

What I read after the book was printed

Re: [R] beginner Q: hashtable or dictionary?

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Mon 30

Jan 2006 - 18:37:00 EST

On Sun, 29 Jan 2006, hadleywickham wrote:

>> use a 'list': > > Is a list O(1) for setting and getting?

Can you elaborate? R is a vector language, and normally you create

a list in one pass, and you can retrieve multiple elements at once.

Retrieving elements by name from a long vector (including a

list) is very fast, as an internal hash table is used.Does the

following item from ONEWS answer your question?

Indexing a vector by a character vector was slow if both

the vector and index were long (say 10,000). Now

hashing is used and the time should be linear in the

longer of the lengths (but more memory is used).

Indexing by number is O(1) except where replacement causes the

list vector to be copied. There is always the option to use match() to

convert to numeric indexing.

-- Brian D. Ripley,

Professor of Applied Statistics,

University of Oxford

Retrieving elements by name from a

long vector (including a list) is very

fast, as an internal hash table is used.

Professor Brian D. Ripley

Page 6: R meetup talk

Today’s talk

• A short introduction to objects in R

• Looking up values in R

– How lookup tables are implemented in R

– Measuring lookup speed

– Optimizing lookup speed

Page 7: R meetup talk

Objects in R

Everything in R is an object. Here are some

examples of objects.

Numeric Vector:

>onehalf<- 1/2

>class(onehalf)

[1] "numeric”

Page 8: R meetup talk

Objects in R

Integer Vector:

> four <- as.integer(4)

> four

[1] 4

>class(four)

[1] "integer”

Page 9: R meetup talk

Objects in R

Character vector:

> zero <- "zero"

>class(zero)

[1] "character”

Page 10: R meetup talk

Objects in R

Logical vector:

>this.is.interesting<- FALSE

>class(this.is.interesting)

[1] "logical"

Page 11: R meetup talk

Objects in R

Vectors can have multiple elements

>one.to.five<- 1:5

>class(one.to.five)

[1] "integer"

>six.to.ten<- c(6, 7, 8, 9, 10)

>class(six.to.ten)

[1] "numeric"

Page 12: R meetup talk

Objects in R

Lists contain heterogeneous collections of objects> stuff <- list(3.14, "hat", FALSE)

>class(stuff)

[1] "list"

Page 13: R meetup talk

Objects in R

Functions are also objects in R:

>f<- function(x, y) {

+ x + y

+ }

>f

function(x, y) {

x + y

}

>class(f)

[1] "function"

Page 14: R meetup talk

Objects in R

Environments map names to objects. They are

used within R itself to map variable names to

objects. You can access these environment

objects, or create your own.> one <- 1

> two <- 2

> three <- 3

> objects()

[1] "one" "three" "two"

>e<- .GlobalEnv

>class(e)

[1] "environment"

>objects(e)

[1] "e" "one" "three" "two"

Page 15: R meetup talk

Lookups

You can look up an item in a vector, list, or array

within R

– Let’s define a vector:

>a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

> a

[1] 1 2 3 4 5 6 7 8 9 10

– You can refer to elements by index:

>a[3]

[1] 3

Page 16: R meetup talk

Lookups

It's also possible to name elements in a vector, then refer to

them by name:

>b<- c(Joe=1, Bob=2, Jim=3)

>b["Bob"]

Bob

This can be very convenient: you can use every vector in R

as a table. You can access the name vector through the

names function:

>names(b)

[1] "Joe" "Bob" "Jim"

Page 17: R meetup talk

Lookups

Named vectors in R are implemented using two

different arrays:

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

Page 18: R meetup talk

Lookups

The name lookup algorithm works roughly like this:

function(vector, name) {

for (i in 1:length(vector)) {

if (names(vector)[i] == name)

return vector[i]

}

return NA

Page 19: R meetup talk

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

Page 20: R meetup talk

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[1]

Page 21: R meetup talk

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[2]

Page 22: R meetup talk

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[4]

Page 23: R meetup talk

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[4]

Page 24: R meetup talk

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[5]

Page 25: R meetup talk

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[5]

Page 26: R meetup talk

Lookups

In vectors,

– Looking up a value by index takes a constant amount

of time.

– Looking up a value by name (potentially) requires

looking at every name in the names array. (This

means that lookup times scale linearly with the

number of items in the table.)

Page 27: R meetup talk

Lookups

Environments store (and fetch) data using a

different structure. They use hash tables.

Hash tables rely on a hash function to map labels

to indices.

Page 28: R meetup talk

Lookups

Simple hash table implementation

Example: store 15 ¾ for “Joe”

1. Calculate h(“Joe”)

2. Store 15 ¾ in the

table in slot h(“Joe”)

1

2

3

4 15 ¾

5

6

h(“Joe”) = 4

Page 29: R meetup talk

Lookups

If you carefully choose the size of the hash table

and the hash function, you can store and lookup

values in constant time (on average) in hash

tables.

Page 30: R meetup talk

Measuring Lookup Speed

In theory, looking up values in environments

should be faster than looking up values in vectors.

In practice, how much difference does this make?

Let’s measure how much time it takes to look up

values in vectors and environments, using different

lookup methods

Page 31: R meetup talk

Measuring Lookup Speed

Let's build a large, labeled vector for testing:labeled.array<- function(n) {

a <- 1:n

from <- “1234567890"

to <- "ABCDEFGHIJ"

for (i in 1:n) {

names(a)[i] <- chartr(from, to, i)

}

a

}

Here's an example of the output of this function:

>a.20 <- labeled.array(20)

>a.20

A B C D E F G H I AJ AA AB AC AD AE AF AG AH AI BJ

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Page 32: R meetup talk

Measuring Lookup Speed

Let's also create environment objects for testing:

labeled.environment<- function(n) {e<- new.env(hash=TRUE, size=n) from <- "1234567890”to <- "ABCDEFGHIJ”for (i in 1:n) {

assign(x=chartr(from, to, i),value=i, envir=e)

}e}

Here’s an example of the output of this function:

> e.20 <- labeled.environment(20)

> e.20

<environment: 0x143756c>

Page 33: R meetup talk

Measuring Lookup Speed

You can fetch values from an environment object

with the get function

>get("A",envir=e.20)

[1] 1

>get("BA",envir=e.20)

[1] 20

You can also fetch values from an environment

with the double bracket operator

> e.20[["A"]]

[1] 1

>e.20[["BA"]]

[1] 20

Page 34: R meetup talk

Measuring Lookup Speed

• Creating examples for testing

arrays <- list()

for (i in 10:15) {

arrays[[as.character(2 ** i)]] <-

labeled.array(2 ** i)

}

environments <- list()

for (i in 10:15) {

environments[[as.character(2 ** i)]] <-

labeled.environment(2 ** i)

}

Page 35: R meetup talk

Measuring Lookup Speed

• Using the test function:

test_expressions("first element, by index:",function(d,l,r) {s<- 0 for (v in 1:r) {s<- s + d[1]

}},arrays, 1024)

• Output:

first element, by index:1024 2048 4096 8192 16384 327680.010 0.003 0.004 0.003 0.005 0.004

Page 36: R meetup talk

Measuring Lookup Speed

• Results for 1024 lookups:

1024 2048 4096 8192 16384 32768

Array index First 0.01 0.003 0.004 0.003 0.005 0.004

Array index Last 0.01 0.004 0.004 0.004 0.003 0.004

Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397

Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266

Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002

Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107

Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003

Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112

Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005

Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005

Page 37: R meetup talk

Measuring Lookup Speed

• Results for 1024 lookups:

1024 2048 4096 8192 16384 32768

Array index First 0.01 0.003 0.004 0.003 0.005 0.004

Array index Last 0.01 0.004 0.004 0.004 0.003 0.004

Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397

Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266

Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002

Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107

Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003

Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112

Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005

Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005

Notice that these values increase linearly with the number of

elements in the array

Page 38: R meetup talk

Measuring Lookup Speed

• Results for 1024 lookups:

1024 2048 4096 8192 16384 32768

Array index First 0.01 0.003 0.004 0.003 0.005 0.004

Array index Last 0.01 0.004 0.004 0.004 0.003 0.004

Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397

Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266

Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002

Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107

Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003

Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112

Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005

Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005

Let’s focus on the results for the largest arrays (which are the

most precise)

Page 39: R meetup talk

Measuring Lookup Speed

• Results for 1024 lookups, 32768 elements:

Array index First 0.004

Array index Last 0.004

Array Label Single Bracket First 5.397

Array Label Single Bracket Last 5.266

Array Label Double Bracket Exact First 0.002

Array Label Double Bracket Exact Last 1.107

Array Label Double Bracket Not exact First 0.003

Array Label Double Bracket Not exact Last 1.112

Environment Label First 0.005

Environment Label Last 0.005

Page 40: R meetup talk

Optimizing Lookup Speed

How to write efficient code:

1. Write code for clarity, not speed

2. Check to see if the code is fast enough. If it is

fast enough, stop.

3. Test your code to find where time is being spent

4. Fix the parts of your code that are taking

enough time.

5. Go to step 2

Page 41: R meetup talk

Optimizing Lookup Speed

• How do you make lookups fast?

– Lookups by position are fastest

– If you have to lookup up single values by name, write

your code with double-brackets

• Double-bracket lookups are a little faster than single bracket

lookups

• If you discover that your code is too slow, you can easily

change from vectors to environments

Page 42: R meetup talk

Optimizing Lookup Speed

• What if

– Your code is too slow

– You need to look up values by name

– It would be hard to change your code to use double-

bracket notation

• Define a bracket operator for environments!

Page 43: R meetup talk

Optimizing Lookup Speed

Remember that everything in R is a function, even

lookup operators.

Example code:

>b<- c(Joe=1, Bob=2, Jim=3)

>b["Bob"]

Bob

2

Page 44: R meetup talk

Optimizing Lookup Speed

Translation of the example code:

>b["Bob"]

Bob

2

>as.list(quote(b["Bob"]))

[[1]]

`[`

[[2]]

b

[[3]]

[1] "Bob"

Page 45: R meetup talk

Optimizing Lookup Speed

R translates

b["B"]

to

`[`(b, "B")

Page 46: R meetup talk

Optimizing Lookup Speed

Here is the code for our new subset function

`[` <- function(x, i, j, ..., drop=TRUE) {

if (class(x) == "environment”) {

get(x=i, envir=x)

} else {

.Primitive("[")(x, i, j, ..., drop=TRUE)

}

}

Page 47: R meetup talk

Optimizing Lookup Speed

Assignments through bracket notation are a little

funny. For example, R evaluates

x[3:5] <- 13:15

as if this code had been executed:

`*tmp*` <- x

x<- "[<-"(`*tmp*`, 3:5, value=13:15)

rm(`*tmp*`)

Page 48: R meetup talk

Optimizing Lookup Speed

Here is the code for our new subset assignment

function

`[<-` <- function(x, i, j, ..., value) {

if (class(x) == "environment”) {

assign(x=i, value=value, envir=x)

# the assign statement returns value,

# but we want to return the environment:

x

} else {

.Primitive("[<-")(x, i, j, ..., value)

}

}

Page 49: R meetup talk

How to reach me

twitter: @jadler

http://www.linkedin.com/in/josephadler

[email protected]

Page 50: R meetup talk

Backup Slides

Page 51: R meetup talk

• A function to test the performance of a lookup

function on an object:

test_expressions<-

function(description, fun, data, reps) {

cat(paste(description,"\n"))

results <- vector()

for (n in names(data)) {

results[[n]] <- system.time(

fun(data[[n]], as.integer(n), reps)

)[["user.self"]]

}

print(results)

}

Page 52: R meetup talk

To figure out the full argument list for the bracket

operator, use the getGeneric function:

>getGeneric("[")

standardGeneric for "[" defined from package "base"

function (x, i, j, ..., drop = TRUE)

standardGeneric("[", .Primitive("["))

<environment: 0x11a6828>

Methods may be defined for arguments: x, i, j, drop

Use showMethods("[") for currently available ones.

Page 53: R meetup talk

In general, you should set new methods with the setMethod function. Example:

setClass("myenv", representation(e="environment"))setMethod("[",signature(x="myenv", i="character”, j="missing"),function(x,i,j,...,drop=TRUE) {

get(x=i,envir=x@e)}

)

Unfortunately, R doesn’t let you redefine these operators for environments, so we have to do something trickier.