reference classes: a case study with the powerlaw package

25
Reference classes: a case study with the poweRlaw package Colin Gillespie Newcastle University, UK http://aperiodical.com/2013/01/log-log-whos-there-not-a-power-law/

Upload: colin-gillespie

Post on 08-Jul-2015

1.306 views

Category:

Technology


1 download

DESCRIPTION

Power-law distributions have been used extensively to characterise many disparate scenarios, inter alia, the sizes of moon craters and annual incomes. Recently power-laws have even been used to characterize terrorist attacks and interstate wars. However, for every correct characterisation that a particular process obeys a power-law, there are many systems that have been incorrectly labelled as being scale-free. Part of the reason for incorrectly categorising systems with power-law properties is the lack of easy to use software. The poweRlaw package aims to tackles this problem by allowing multiple heavy tail distributions, to be fitted within a standard framework. Within this package, different distributions are represented using reference classes. This enables a consistent interface to be constructed for plotting and parameter inference. This talk will describe the advantages (and disadvantages) of using reference classes. In particular, how reference classes can be leveraged to allow fast, efficient computation via parameter caching. The talk will also touch upon potential difficulties such as combining reference classes with parallel computation.

TRANSCRIPT

Page 1: Reference classes: a case study with the poweRlaw package

Reference classes: a case study with the poweRlawpackage

Colin GillespieNewcastle University, UK

http://aperiodical.com/2013/01/log-log-whos-there-not-a-power-law/

Page 2: Reference classes: a case study with the poweRlaw package

The power law distribution

Name f (x) Notes

Power law x−α Pareto distribution

Log-normal 1x exp(− (ln(x)−µ)2

2σ2 )Exponential e−λx

Power law x−α Zeta distributionPower law x−α x = 1, . . . , n, Zipf’s dist’

Yule Γ(x)Γ(x+α)

Poisson λx /x !

Page 3: Reference classes: a case study with the poweRlaw package

Alleged power-law phenomena

The frequency of occurrence of unique words in the novel Moby Dick byHerman Melville

The numbers of customers affected in electrical blackouts in the UnitedStates between 1984 and 2002

The number of links to web sites found in a 1997 web crawl of about 200million web pages

The number of hits on web pages

The number of papers scientist write

The number of citations received by papers

Annual incomes

Sales of books, music; in fact anything that can be sold

Page 4: Reference classes: a case study with the poweRlaw package

Alleged power-law phenomena

The frequency of occurrence of unique words in the novel Moby Dick byHerman Melville

The numbers of customers affected in electrical blackouts in the UnitedStates between 1984 and 2002

The number of links to web sites found in a 1997 web crawl of about 200million web pages

The number of hits on web pages

The number of papers scientist write

The number of citations received by papers

Annual incomes

Sales of books, music; in fact anything that can be sold

Page 5: Reference classes: a case study with the poweRlaw package

Zipf plots

Blackouts Fires Flares

Moby Dick Terrorism Web links

10−8

10−6

10−4

10−2

100

10−8

10−6

10−4

10−2

100

100 102 104 106 100 102 104 106 100 102 104 106

x

1−P

(x)

Page 6: Reference classes: a case study with the poweRlaw package

The power law distribution

The power-law distribution is

p(x) ∝ x−α

where α, the scaling parameter, is constantThe scaling parameter typically lies in the range 2 < α < 3, althoughthere are some occasional exceptions

When α < 2, all moments are infinite

Typically, the entire process doesn’t obey a power law

Instead, the power law applies only for values greater than someminimum xmin

Page 7: Reference classes: a case study with the poweRlaw package

The power law distribution

The power-law distribution is

p(x) ∝ x−α

where α, the scaling parameter, is constantThe scaling parameter typically lies in the range 2 < α < 3, althoughthere are some occasional exceptions

When α < 2, all moments are infinite

Typically, the entire process doesn’t obey a power law

Instead, the power law applies only for values greater than someminimum xmin

Page 8: Reference classes: a case study with the poweRlaw package

Power law: PMF & CMF

Discrete power law, the PMF is

p(x) =x−α

ζ(α, xmin)

where α > 1, xmin ≥ 1 and

ζ(α, xmin) =∞

∑n=0

(n + xmin)−α

is the generalised zeta function

When xmin = 1, ζ(α, 1) is the standardzeta function

PDF

CDF

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40 50x

1.50 1.75 2.00 2.25 2.50

α

Page 9: Reference classes: a case study with the poweRlaw package

Fitting power laws

The main technique for fitting power laws comes from Clausett et al, 2009This paper gets around ten new citations a week

Estimating α given xmin is straightforward - just use the mle

The lower cut-off, xmin, is estimated using a Kolmogorov-Smirnoffapproach

Page 10: Reference classes: a case study with the poweRlaw package

The poweRlaw package

The package is available on CRAN and at

https://github.com/csgillespie/poweRlaw

Makes fitting power laws easy to fit

Crucially, it makes fitting (to the tails) of the log normal, exponential,Poisson equally easy

Consistent interface between distributions

Estimate parameter uncertainty

Compare distributions (statistically and visually)

Page 11: Reference classes: a case study with the poweRlaw package

Case study: Moby Dick

R> m_pl = displ$new(moby)

Page 12: Reference classes: a case study with the poweRlaw package

Case study: Moby Dick

R> m_pl = displ$new(moby)

R> plot(m_pl)●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100

Page 13: Reference classes: a case study with the poweRlaw package

Case study: Moby Dick

R> m_pl = displ$new(moby)

R> (est = estimate_xmin(m_pl))

$KS

[1] 0.009229

$xmin

[1] 7

$pars

[1] 1.95

attr(,"class")

[1] "estimate_xmin"

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100

Page 14: Reference classes: a case study with the poweRlaw package

Case study: Moby Dick

R> m_pl = displ$new(moby)

R> est = estimate_xmin(m_pl)

R> m_pl$setXmin(est)

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100

Page 15: Reference classes: a case study with the poweRlaw package

Case study: Moby Dick

R> m_pl = displ$new(moby)

R> est = estimate_xmin(m_pl)

R> m_pl$setXmin(est)

R> lines(m_pl)

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100

Page 16: Reference classes: a case study with the poweRlaw package

Case study: Moby Dick

R> m_pl = displ$new(moby)

R> est = estimate_xmin(m_pl)

R> m_pl$setXmin(est)

R> lines(m_pl)

R> m_ln = dislnorm$new(moby)

R> est = estimate_xmin(m_ln)

R> m_ln$setXmin(est)

R> lines(m_ln)

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100

Page 17: Reference classes: a case study with the poweRlaw package

Why use objects?

Each distribution is represented by an object:Parent class: distributionPower-law: displ, log-normal: disln, . . .

Method dispatch on object class:dist_pdf(m) returns the probability density function based on the class ofm

Consistent interface:Bootstrapping:

R> bootstrap(m)

Model selection:

R> compare_distributions(m1, m2)

Simple interface that enables easy addition of new distributions (currentlythere are seven available distributions to fit)

Page 18: Reference classes: a case study with the poweRlaw package

Reference classes

Reference classes behave like classes in C++, Python and many otherlanguages - not like standard R classes

You can use these classes with ordinary R expressions and functions

An extension to core R (October, 2010)

Big difference - mutable state

Page 19: Reference classes: a case study with the poweRlaw package

Mutable states

R> displ = setRefClass("displ", fields = "xmin")

R> d1 = displ$new(xmin = 1)

R> d1$xmin

[1] 1

R> d2 = d1

R> d2$xmin = 100

R> d2$xmin

[1] 100

R> d1$xmin

[1] 100

Page 20: Reference classes: a case study with the poweRlaw package

Mutable states

R> displ = setRefClass("displ", fields = "xmin")

R> d1 = displ$new(xmin = 1)

R> d1$xmin

[1] 1

R> d2 = d1

R> d2$xmin = 100

R> d2$xmin

[1] 100

R> d1$xmin

[1] 100

Page 21: Reference classes: a case study with the poweRlaw package

Mutable states

R> displ = setRefClass("displ", fields = "xmin")

R> d1 = displ$new(xmin = 1)

R> d1$xmin

[1] 1

R> d2 = d1

R> d2$xmin = 100

R> d2$xmin

[1] 100

R> d1$xmin

[1] 100

Page 22: Reference classes: a case study with the poweRlaw package

Mutable states

When estimating xmin, a naive implementation makes this calculation slow

Efficient caching speeds up calculations 100 fold

For example, using the call

R> m_pl$setXmin(10)

updates internal variables that makes future calculations quicker

On creation of a distribution object, we make "multiple copies" of the data

R> x

R> cumsum(log(x))

using reference classes avoids constant copying and speeds upcalculations

R> pl_ref$xmin = 10

R> pl_s4@xmin = 10

Page 23: Reference classes: a case study with the poweRlaw package

Mutable states

When estimating xmin, a naive implementation makes this calculation slow

Efficient caching speeds up calculations 100 fold

For example, using the call

R> m_pl$setXmin(10)

updates internal variables that makes future calculations quicker

On creation of a distribution object, we make "multiple copies" of the data

R> x

R> cumsum(log(x))

using reference classes avoids constant copying and speeds upcalculations

R> pl_ref$xmin = 10

R> pl_s4@xmin = 10

Page 24: Reference classes: a case study with the poweRlaw package

Comments

Reference classes are still newCode has now broken twice with R upgradesroxygen2 and reference classes didn’t play well together

Very few questions on Stackoverflow on reference classes

Structuring code and files

Care has to be taken when using them with parallel computing

Page 25: Reference classes: a case study with the poweRlaw package

References

Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-lawdistributions in empirical data. SIAM review 51.4 (2009): 661–703.

poweRlaw package

https://github.com/csgillespie/poweRlaw