reference classes: a case study with the powerlaw package

Reference classes: a case study with the poweRlawpackage

Colin GillespieNewcastle University, UK

http://aperiodical.com/2013/01/log-log-whos-there-not-a-power-law/

The power law distribution

Name f (x) Notes

Power law x−α Pareto distribution

Log-normal 1x exp(− (ln(x)−µ)2

2σ2 )Exponential e−λx

Power law x−α Zeta distributionPower law x−α x = 1, . . . , n, Zipf’s dist’

Yule Γ(x)Γ(x+α)

Poisson λx /x !

Alleged power-law phenomena

The frequency of occurrence of unique words in the novel Moby Dick byHerman Melville

The numbers of customers affected in electrical blackouts in the UnitedStates between 1984 and 2002

The number of links to web sites found in a 1997 web crawl of about 200million web pages

The number of hits on web pages

The number of papers scientist write

The number of citations received by papers

Annual incomes

Sales of books, music; in fact anything that can be sold

Zipf plots

Blackouts Fires Flares

Moby Dick Terrorism Web links

10−8

10−6

10−4

10−2

100

10−8

10−6

10−4

10−2

100

100 102 104 106 100 102 104 106 100 102 104 106

x

1−P

(x)

The power law distribution

The power-law distribution is

p(x) ∝ x−α

where α, the scaling parameter, is constantThe scaling parameter typically lies in the range 2 < α < 3, althoughthere are some occasional exceptions

When α < 2, all moments are infinite

Typically, the entire process doesn’t obey a power law

Instead, the power law applies only for values greater than someminimum xmin

Power law: PMF & CMF

Discrete power law, the PMF is

p(x) =x−α

ζ(α, xmin)

where α > 1, xmin ≥ 1 and

ζ(α, xmin) =∞

∑n=0

(n + xmin)−α

is the generalised zeta function

When xmin = 1, ζ(α, 1) is the standardzeta function

PDF

CDF

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40 50x

1.50 1.75 2.00 2.25 2.50

α

Fitting power laws

The main technique for fitting power laws comes from Clausett et al, 2009This paper gets around ten new citations a week

Estimating α given xmin is straightforward - just use the mle

The lower cut-off, xmin, is estimated using a Kolmogorov-Smirnoffapproach

The poweRlaw package

The package is available on CRAN and at

https://github.com/csgillespie/poweRlaw

Makes fitting power laws easy to fit

Crucially, it makes fitting (to the tails) of the log normal, exponential,Poisson equally easy

Consistent interface between distributions

Estimate parameter uncertainty

Compare distributions (statistically and visually)


Case study: Moby Dick

R> m_pl = displ$new(moby)



R> plot(m_pl)●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

●

●

●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100



R> (est = estimate_xmin(m_pl))

$KS

[1] 0.009229

$xmin

[1] 7

$pars

[1] 1.95

attr(,"class")

[1] "estimate_xmin"

●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

●

●

●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100



R> est = estimate_xmin(m_pl)

R> m_pl$setXmin(est)

●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

●

●

●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100





R> lines(m_pl)

●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

●

●

●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100





R> lines(m_pl)

R> m_ln = dislnorm$new(moby)

R> est = estimate_xmin(m_ln)

R> m_ln$setXmin(est)

R> lines(m_ln)

●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●●

●

●

●

Words

CD

F

100 101 102 103 104

10−4

10−3

10−2

10−1

100

Why use objects?

Each distribution is represented by an object:Parent class: distributionPower-law: displ, log-normal: disln, . . .

Method dispatch on object class:dist_pdf(m) returns the probability density function based on the class ofm

Consistent interface:Bootstrapping:

R> bootstrap(m)

Model selection:

R> compare_distributions(m1, m2)

Simple interface that enables easy addition of new distributions (currentlythere are seven available distributions to fit)

Reference classes

Reference classes behave like classes in C++, Python and many otherlanguages - not like standard R classes

You can use these classes with ordinary R expressions and functions

An extension to core R (October, 2010)

Big difference - mutable state

Mutable states

R> displ = setRefClass("displ", fields = "xmin")

R> d1 = displ$new(xmin = 1)

R> d1$xmin

[1] 1

R> d2 = d1

R> d2$xmin = 100

R> d2$xmin

[1] 100

R> d1$xmin

[1] 100

Mutable states

When estimating xmin, a naive implementation makes this calculation slow

Efficient caching speeds up calculations 100 fold

For example, using the call

R> m_pl$setXmin(10)

updates internal variables that makes future calculations quicker

On creation of a distribution object, we make "multiple copies" of the data

R> x

R> cumsum(log(x))

using reference classes avoids constant copying and speeds upcalculations

R> pl_ref$xmin = 10

R> pl_s4@xmin = 10

Comments

Reference classes are still newCode has now broken twice with R upgradesroxygen2 and reference classes didn’t play well together

Very few questions on Stackoverflow on reference classes

Structuring code and files

Care has to be taken when using them with parallel computing

References

Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-lawdistributions in empirical data. SIAM review 51.4 (2009): 661–703.

poweRlaw package



reference classes: a case study with the powerlaw package

Technology