reference classes: a case study with the powerlaw package
DESCRIPTION
Power-law distributions have been used extensively to characterise many disparate scenarios, inter alia, the sizes of moon craters and annual incomes. Recently power-laws have even been used to characterize terrorist attacks and interstate wars. However, for every correct characterisation that a particular process obeys a power-law, there are many systems that have been incorrectly labelled as being scale-free. Part of the reason for incorrectly categorising systems with power-law properties is the lack of easy to use software. The poweRlaw package aims to tackles this problem by allowing multiple heavy tail distributions, to be fitted within a standard framework. Within this package, different distributions are represented using reference classes. This enables a consistent interface to be constructed for plotting and parameter inference. This talk will describe the advantages (and disadvantages) of using reference classes. In particular, how reference classes can be leveraged to allow fast, efficient computation via parameter caching. The talk will also touch upon potential difficulties such as combining reference classes with parallel computation.TRANSCRIPT
Reference classes: a case study with the poweRlawpackage
Colin GillespieNewcastle University, UK
http://aperiodical.com/2013/01/log-log-whos-there-not-a-power-law/
The power law distribution
Name f (x) Notes
Power law x−α Pareto distribution
Log-normal 1x exp(− (ln(x)−µ)2
2σ2 )Exponential e−λx
Power law x−α Zeta distributionPower law x−α x = 1, . . . , n, Zipf’s dist’
Yule Γ(x)Γ(x+α)
Poisson λx /x !
Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick byHerman Melville
The numbers of customers affected in electrical blackouts in the UnitedStates between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200million web pages
The number of hits on web pages
The number of papers scientist write
The number of citations received by papers
Annual incomes
Sales of books, music; in fact anything that can be sold
Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick byHerman Melville
The numbers of customers affected in electrical blackouts in the UnitedStates between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200million web pages
The number of hits on web pages
The number of papers scientist write
The number of citations received by papers
Annual incomes
Sales of books, music; in fact anything that can be sold
Zipf plots
Blackouts Fires Flares
Moby Dick Terrorism Web links
10−8
10−6
10−4
10−2
100
10−8
10−6
10−4
10−2
100
100 102 104 106 100 102 104 106 100 102 104 106
x
1−P
(x)
The power law distribution
The power-law distribution is
p(x) ∝ x−α
where α, the scaling parameter, is constantThe scaling parameter typically lies in the range 2 < α < 3, althoughthere are some occasional exceptions
When α < 2, all moments are infinite
Typically, the entire process doesn’t obey a power law
Instead, the power law applies only for values greater than someminimum xmin
The power law distribution
The power-law distribution is
p(x) ∝ x−α
where α, the scaling parameter, is constantThe scaling parameter typically lies in the range 2 < α < 3, althoughthere are some occasional exceptions
When α < 2, all moments are infinite
Typically, the entire process doesn’t obey a power law
Instead, the power law applies only for values greater than someminimum xmin
Power law: PMF & CMF
Discrete power law, the PMF is
p(x) =x−α
ζ(α, xmin)
where α > 1, xmin ≥ 1 and
ζ(α, xmin) =∞
∑n=0
(n + xmin)−α
is the generalised zeta function
When xmin = 1, ζ(α, 1) is the standardzeta function
CDF
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40 50x
1.50 1.75 2.00 2.25 2.50
α
Fitting power laws
The main technique for fitting power laws comes from Clausett et al, 2009This paper gets around ten new citations a week
Estimating α given xmin is straightforward - just use the mle
The lower cut-off, xmin, is estimated using a Kolmogorov-Smirnoffapproach
The poweRlaw package
The package is available on CRAN and at
https://github.com/csgillespie/poweRlaw
Makes fitting power laws easy to fit
Crucially, it makes fitting (to the tails) of the log normal, exponential,Poisson equally easy
Consistent interface between distributions
Estimate parameter uncertainty
Compare distributions (statistically and visually)
Case study: Moby Dick
R> m_pl = displ$new(moby)
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> plot(m_pl)●
●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●
●●●
●
●
●
Words
CD
F
100 101 102 103 104
10−4
10−3
10−2
10−1
100
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> (est = estimate_xmin(m_pl))
$KS
[1] 0.009229
$xmin
[1] 7
$pars
[1] 1.95
attr(,"class")
[1] "estimate_xmin"
●
●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●
●●●
●
●
●
Words
CD
F
100 101 102 103 104
10−4
10−3
10−2
10−1
100
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> est = estimate_xmin(m_pl)
R> m_pl$setXmin(est)
●
●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●
●●●
●
●
●
Words
CD
F
100 101 102 103 104
10−4
10−3
10−2
10−1
100
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> est = estimate_xmin(m_pl)
R> m_pl$setXmin(est)
R> lines(m_pl)
●
●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●
●●●
●
●
●
Words
CD
F
100 101 102 103 104
10−4
10−3
10−2
10−1
100
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> est = estimate_xmin(m_pl)
R> m_pl$setXmin(est)
R> lines(m_pl)
R> m_ln = dislnorm$new(moby)
R> est = estimate_xmin(m_ln)
R> m_ln$setXmin(est)
R> lines(m_ln)
●
●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●
●●●
●
●
●
Words
CD
F
100 101 102 103 104
10−4
10−3
10−2
10−1
100
Why use objects?
Each distribution is represented by an object:Parent class: distributionPower-law: displ, log-normal: disln, . . .
Method dispatch on object class:dist_pdf(m) returns the probability density function based on the class ofm
Consistent interface:Bootstrapping:
R> bootstrap(m)
Model selection:
R> compare_distributions(m1, m2)
Simple interface that enables easy addition of new distributions (currentlythere are seven available distributions to fit)
Reference classes
Reference classes behave like classes in C++, Python and many otherlanguages - not like standard R classes
You can use these classes with ordinary R expressions and functions
An extension to core R (October, 2010)
Big difference - mutable state
Mutable states
R> displ = setRefClass("displ", fields = "xmin")
R> d1 = displ$new(xmin = 1)
R> d1$xmin
[1] 1
R> d2 = d1
R> d2$xmin = 100
R> d2$xmin
[1] 100
R> d1$xmin
[1] 100
Mutable states
R> displ = setRefClass("displ", fields = "xmin")
R> d1 = displ$new(xmin = 1)
R> d1$xmin
[1] 1
R> d2 = d1
R> d2$xmin = 100
R> d2$xmin
[1] 100
R> d1$xmin
[1] 100
Mutable states
R> displ = setRefClass("displ", fields = "xmin")
R> d1 = displ$new(xmin = 1)
R> d1$xmin
[1] 1
R> d2 = d1
R> d2$xmin = 100
R> d2$xmin
[1] 100
R> d1$xmin
[1] 100
Mutable states
When estimating xmin, a naive implementation makes this calculation slow
Efficient caching speeds up calculations 100 fold
For example, using the call
R> m_pl$setXmin(10)
updates internal variables that makes future calculations quicker
On creation of a distribution object, we make "multiple copies" of the data
R> x
R> cumsum(log(x))
using reference classes avoids constant copying and speeds upcalculations
R> pl_ref$xmin = 10
R> pl_s4@xmin = 10
Mutable states
When estimating xmin, a naive implementation makes this calculation slow
Efficient caching speeds up calculations 100 fold
For example, using the call
R> m_pl$setXmin(10)
updates internal variables that makes future calculations quicker
On creation of a distribution object, we make "multiple copies" of the data
R> x
R> cumsum(log(x))
using reference classes avoids constant copying and speeds upcalculations
R> pl_ref$xmin = 10
R> pl_s4@xmin = 10
Comments
Reference classes are still newCode has now broken twice with R upgradesroxygen2 and reference classes didn’t play well together
Very few questions on Stackoverflow on reference classes
Structuring code and files
Care has to be taken when using them with parallel computing
References
Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-lawdistributions in empirical data. SIAM review 51.4 (2009): 661–703.
poweRlaw package
https://github.com/csgillespie/poweRlaw