effects of differing statistical methodologies on inferences about earth-like exoplanet populations

18
Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations John Owens Abstract Sophisticated statistical methodologies have become vital to the study of exoplanet formation and detec- tion. As this field continues to grow and a wider range of methodologies are utilized, the question of the appropriateness of different techniques becomes more pressing. The following comparisons between differ- ing methods of searching through planet-transit catalogs to determine the planet occurrence rate and their corresponding results demonstrate the proper and improper situations to use different statistical methods. I give explanations of each method, the assumptions each author made in their studies, and the differences in the resulting values for the occurrence rate. It appears that using the inverse-detection-efficiency method results in consistently higher occurrence rates than using the likelihood method. 1 Introduction and background Exoplanets have been the subject of great study in the twentieth and twenty-first centuries. Several studies have published various data about different features of exoplanetary systems, the most prevalent being the occurrence rate of planets in these systems. The Kepler mission, launched in 2009 by the National Aeronautics and Space Administration, provided astronomers with an extensive catalog of planets that transit (cross in front of) their host stars (Petigura et al. 2013). When planets transit, they dim the brightness of their star (the signal of the star is reduced when observed). The magnitude of the change in visual brightness depends on the relative sizes of the star and the planet. Since the size of the star should already be known, it is straightforward to determine the size of the planet from this relation. Planets are most easily detectable when they are relatively large and have small orbital periods (they orbit close to their host stars). In order to detect a planet using this method, the 1

Upload: john-owens

Post on 17-Jan-2017

133 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

Effects of differing statistical methodologies on inferences about

Earth-like exoplanet populations

John Owens

Abstract

Sophisticated statistical methodologies have become vital to the study of exoplanet formation and detec-

tion. As this field continues to grow and a wider range of methodologies are utilized, the question of the

appropriateness of different techniques becomes more pressing. The following comparisons between differ-

ing methods of searching through planet-transit catalogs to determine the planet occurrence rate and their

corresponding results demonstrate the proper and improper situations to use different statistical methods.

I give explanations of each method, the assumptions each author made in their studies, and the differences

in the resulting values for the occurrence rate. It appears that using the inverse-detection-efficiency method

results in consistently higher occurrence rates than using the likelihood method.

1 Introduction and background

Exoplanets have been the subject of great study in the twentieth and twenty-first centuries. Several

studies have published various data about different features of exoplanetary systems, the most prevalent

being the occurrence rate of planets in these systems. The Kepler mission, launched in 2009 by the National

Aeronautics and Space Administration, provided astronomers with an extensive catalog of planets that

transit (cross in front of) their host stars (Petigura et al. 2013).

When planets transit, they dim the brightness of their star (the signal of the star is reduced when

observed). The magnitude of the change in visual brightness depends on the relative sizes of the star and

the planet. Since the size of the star should already be known, it is straightforward to determine the size

of the planet from this relation. Planets are most easily detectable when they are relatively large and have

small orbital periods (they orbit close to their host stars). In order to detect a planet using this method, the

1

Page 2: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

orbit of the planet must be “edge-on” to an observer on Earth. In other words, the orbit must be positioned

such that an observer could view the planet crossing in front of the star. The probability of an edge-on orbit

is the ratio of the diameter of the star to the diameter of the orbit. This value is equal to about 0.5% for

an Earth-size planet orbiting a Sun-size star. When creating pipelines to detect planet transits from the

Kepler data, astronomers inject synthetic (false) planet signals into their pipelines to determine the rate of

false positives. Using the false positive information and the probability of planet transit, one can determine

the efficiency of a pipeline to detect planet transits.

The occurrence rate is defined as the fraction of stars having a planet within a specified range of

parameters (Petigura et al. 2013). This value is significant because it allows us to predict the probability

of a system having a planet given certain parameters. One of the most intriguing sets of parameters that

can be used with occurrence rate data is those of our Solar System. When applied to these calculations, we

can predict the occurrence rate of Earth-like planets orbiting Sun-like stars which allows for further research

into potential habitability.

The goal of the Kepler mission is to detect transiting planets in a large domain of continuously-

observed stars. The Kepler telescope maintains focus on a very large field of view (105 square degrees).

Kepler views the same field for at least 3.5 years; a time that can be extended to allow for the detection

of smaller and more distant planets. The transit data gathered by Kepler is used to determine planet

occurrence rates and other statistics. This information was found on NASA’s Kepler homepage.

Previous studies that use the Kepler data claim that the most common small planets are those

approaching Earth-size but that orbit close to their host stars. The studies I examine here extend the planet

survey to those that are Earth-size and orbit at a distance such that they receive a similar intensity of light

energy as Earth.

Many studies have been done to calculate the occurrence rate of planets both Earth-like and

otherwise, some of which will be the focus of this paper. I will compare several different methods used to

calculate the occurrence rate and the effects of different methods on resulting values. My primary points

of concern will be on key assumptions made during calculations that differ between authors. The term

statistical methodologies refers to the methods used to analyze or represent data. The usage of multiple

statistical methodologies to determine the planet occurrence rate poses a unique issue. In theory, many

methods may be used to arrive at the same result. The challenge, however, comes from the fact that no

two methodologies will produce the exact same result in practice for distributions of this complexity. In the

2

Page 3: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

presence of noisy data produced by a complicated instrument and analysis, statistical inference offers more

choices of methods to determine the same statistics. The purpose of this paper is not to determine the “best”

methodology for determining the occurrence rate, but rather comment on the effects of the differences in

existing methodologies.

(Petigura et al. 2013) use their own TERRA software package to search for transiting planets in

the National Aeronautics and Space Administration’s (NASA’s) Kepler mission data. TERRA first accounts

for systematic error common to many stars, outliers in the data, and variability caused by long timescales.

The software then searches for planet transit signals by evaluating the signal-to-noise ratio at locations of

prospective transits in time. The signal-to-noise ratio compares the power of a desired signal to the power

of the background noise. It is significant to note that they correct for candidates missed by TERRA by

including the probability that the orbital plane of a planet would not be conducive to transit detection, and

by injecting “transit-like synthetic dimmings” into real Kepler photometry. Petigura et al. calculate their

occurrence rate by first creating a grid of planets sorted by orbital period, P, and planet radius, Rp. Both of

these values can be measured from the Kepler photometry. They then count the number of detected planets

in each cell and compute P,Rp, as well as the distribution of planet sizes and orbital periods.

(Foreman-Mackey et al. 2014) use a Bayesian hierarchical probabilistic inference form to calculate

the occurrence rate density (see subsection 4.1 on Bayesian inference). They apply this method to the catalog

of Earth-like planet candidates determined by the TERRA pipeline. The steps they take begin with creating

a likelihood function and then calculating the detection efficiency in the same bins used by Petigura et al.

The likelihood function uses a set of model parameters to describe the probability of observing a specific

data set. Finally, they constrain the occurrence rate density of Earth-like planets by evaluating occurrence

rate at the location of Earth on the (P,Rp) grid.

(Dressing Charbonneau 2015) perform similar calculations to Petigura et al. and Foreman-Mackey

et al. but limit their scope to small planets orbiting small stars. It could be argued that focusing on smaller

systems results in more accurate constraints for solar-like systems. However, computing the occurrence rate

in this setting is complicated by the fact that it is more difficult to measure the parameters of low-mass stars

than the parameters of Sun-like stars. Dressing et al.’s 2015 paper is written to combine improved methods

of calculating the occurrence rate that were published after their 2013 paper (Dressing Charbonneau 2013).

Thus, they make different assumptions than Petigura et al., Foreman-Mackey et al., and their own from their

previous paper. Their calculations of the occurrence rate generally resemble those used by (Foreman-Mackey

3

Page 4: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

et al. 2014).

I will examine the calculations of the occurrence rate performed by Petigura et al., Foreman-Mackey

et al., and Dressing et al. in several lenses.

1. The dependence on orbital period, planet radius, and habitable zone.

2. The dependence of the results on the inverse-detection efficiency method versus the likelihood function

method.

3. Differences in survey completeness.

4. Effects of differing assumptions made by the the authors.

5. Possible bias from differing intents of the papers.

6. Effects on other statistics in these studies.

7. Possible ramifications for future studies.

It is first necessary to explain the differences in the calculated occurrence rates and their depen-

dencies on certain physical parameters. These provide context for the hypothetical explanations for their

differences to come later.

2 Calculated occurrence rates and dependencies

Each author provides their calculated occurrence rate in different forms. Petigura et al. give several

occurrences rates for different domains of Rp and P, focusing primarily on Earth-size planets. Foreman-

Mackey et al. give a single value for “Earth analogs”. The data they gather comes from directly applying

their inference method to the catalog of small exoplanet candidates orbiting Sun-like stars published by

Petigura et al. They define Earth analogs as planet candidates that have very similar orbital periods and radii

to that of Earth (Foreman-Mackey et al. 2014). The value that they give is for what they term “occurrence

rate density”, not simply the occurrence rate. Dressing et al. present detailed tables of occurrence rates for

various ranges of orbital period, planet radius, and insolation, as well as values in the habitable zone (HZ).

Overall, Petigura et al. find that 26±3% of Sun-like stars are orbited by an Earth-size planet (with

a planet radius 1 − 2 R⊕) with P = 5 − 100 days (Petigura et al. 2013). Figure 1 shows the distribution

of planets in bins of log period and log radius. The respective occurrence rates for each bin are indicated

4

Page 5: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

as well. Graphically, it appears that the binned occurrence rate depends significantly on log radius. Data

about multiple-planet systems is not represented in their study. In their TERRA pipeline, no Earth-size

planets were detected that had Earth-like periods (P = 200− 400 days). However, the radii of three planets

did exhibit 1σ confidence intervals that extend into the domain (P = 200−400 d, Rp = 1−2 R⊕). This fact

leads to the assumption that the existence of an Earth-size planet with an Earth-size radius is still plausible,

however it has yet to be detected. Petigura et al. estimate the occurrence rate of these 1 − 2 R⊕ planets

with periods of 200-400 d by extrapolating the overall planet occurrence with P. This yields a 5.7+1.7−2.2%

occurrence of Earth-size (1− 2 R⊕) planets with periods of 200− 400 d.

The term “rate density” is used by Foreman-Mackey et al. to indicate the integrand over a finite

bin in period and radius that results in a rate. A rate density attempts to correct for differences in surveys

that go to different depths. Foreman-Mackey et al. find that the occurrence rate density for Earth analogs

is 0.019+0.019−0.010 nat

−2, per natural logarithmic period per natural logarithmic radius (Foreman-Mackey et al.

2014). For reference, the authors also converted Petigura et al.’s results to these units, yielding a value of

0.119+0.046−0.035 nat

−2. They note several features of the data. The period distribution is not consistent between

large (R > 8 R⊕) planets and small planets. Foreman-Mackey et al. do not graphically show their data

in binned form, but Figure 2 does show the log radius-log period relationship in their data. As is the case

with the data presented by Petigura et al., the occurrence rate appears to depend more on planet size than

orbital period. The radius distribution, however, is qualitatively consistent between large and small planets.

Potential features near R ∼ 3 R⊕ and R ∼ 10 R⊕ are noted. Foreman-Mackey et al. claim that their results

are “completely inconsistent” with the results of Petigura et al., despite being based on the same data set.

Comparing the results calculated by Foreman-Mackey et al. and those of Petigura et al. (converted

to values in occurrence rate density), it is rather simple to notice that the value determined by Foreman-

Mackey et al. is quite lower than the value determined by Petigura et al. It is also interesting to note that

the error bars of each value do not overlap. The value published by Foreman-Mackey et al. exhibits much

larger margins of error than that by Petigura et al. (almost twice as large in a log scale). This may be

attributed to the fact that Foreman-Mackey et al. consider the observational uncertainties on the physical

parameters non-negligible, thus increasing the propagation of error for the final value.

Dressing et al. find a cumulative occurrence rate of 2.5± 0.2 planets (R = 1− 4 R⊕, P < 200 d)

per M dwarf (Dressing Charbonneau 2015). They also give occurrence rates for various ranges of planet size

and period. For planets with periods such that P < 50 days, the planet occurrence rate decreases as planet

5

Page 6: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

radius increases from 1 R⊕ to 4 R⊕. Overall, the occurrence rate increases with orbital period between

0−200 days. Dressing et al. also perform an in-depth analysis of varying occurrence rate with planets in the

habitable zone. Under conservative assumptions for the range of the habitable zone (1.0 R⊕ < R < 1.5 R⊕),

the occurrence rate is estimated as 0.16+0.170.07 (potentially) habitable planets (1 − 1.5 R⊕) per M dwarf.

According to their estimates, it could be suggested that habitable zone planets are more common around

lower-mass stars. Figure 3 shows Dressing et al.’s distribution of planets in binned form. It is prudent to

note that the binned occurrence values given by Dressing et al. appear to be consistently greater than the

same values given by Petigura et al.

Table 1. Conservative occurrence rates from different studies

Study Occurrence rate Occurrence rate density

Petigura et al. 2013 5.7+1.7−2.2% 0.119+0.046

−0.035 nat−2

Foreman-Mackey et al. 2014 0.019+0.019−0.010 nat

−2

Dressing et al. 2015 0.16+0.170.07 %

Figure 1: Planet occurrence as a function of planet radius and orbital period. The distribution is organizedinto bins of log radius and log period. Red dots represent detected planets. Each bin displays the occurrencerate in that bin and is colored according to that occurrence rate. The bulk of the planets are distributedbetween 1− 4 R⊕ and 10− 200 days. Figure taken from (Petigura et al. 2013).

6

Page 7: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

Figure 2: Center: The points represent detected planets. The contours represent the completeness function.The grayscale represents the occurrence rate density. The points are organized in bins of log radius and logperiod. The density of the planet distribution is high between 0.0− 1.5 lnR/R⊕ and 2.0− 4.0 lnP/day. Topand right: Histograms of the inferred rate density using the likelihood method. The points with error barsare the results of the inverse-detection-efficiency method. Figure taken from (Foreman-Mackey et al. 2014).

3 Inverse-detection efficiency vs. likelihood methods

Petigura et al. calculate the distribution of exoplanets in their survey by a process termed “inverse-

detection efficiency” (Petigura et al. 2013). In this process, they inject fake planet transit signals into the

light curves of Sun-like stars and recover them using TERRA. They divide the recovered signals into bins

by radius and period and weight the population of each bin by the inverse of their detection efficiency.

Foreman-Mackey et al. argue that the “likelihood method” is superior to the inverse-detection

efficiency method used by Petigura et al. They use the same catalog of measurements published by Petigura

et al. but treat it as a draw from an inhomogeneous Poisson process set by the observable rate density. A

Poisson process is a model for distributions of points that are randomly located in space. An inhomogeneous

7

Page 8: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

Figure 3: Left: Planet occurrence rate binned by orbital period and planet radius. The occurrence rate islisted as a percentage at the top of each bin. The percentage of injected planets recovered by the pipelineis listed at the bottom of each bin. The occurrence rate is highest between 0.5− 2.5 R⊕ and 4− 100 days.Right: Planet occurrence rate binned by planet insolation and radius. The occurrence rate and injectionrecovery rate are listed the same as left. The occurrence rate is highest between 0.5− 2.5 R⊕ and 12− 1 F⊕.Figure taken from (Dressing Charbonneau 2015).

Poisson process has some underlying function as a parameter that is dependent on spatial location. This

yields an exponential function containing the integral of the observable rate density over the set of physical

parameters contained in the catalog data. Since the observable rate density is a product of the detection

efficiency and true occurrence rate density functions (both dependent on the same physical parameters), they

infer the true occurrence rate density. They then model the rate density as a piecewise constant step function

and derive the analytic maximum likelihood solution for the step heights. According to (Foreman-Mackey

et al. 2014), this method is “guaranteed to provide a lower variance estimate of the rate density than the

standard procedure”.

Dressing et al. use a method that is very similar to the inverse-detection method used by Petigura

et al., but involves using their own custom pipeline. They use the Kepler catalog as their source and search

the light curves of the stars in that catalog for exoplanets (Dressing Charbonneau 2015). Their pipeline

searches each light curve sequentially so that it can pick up multiple planets per star, if warranted by the

data. This is an improvement over the methods used by Petigura et al. and Foreman-Mackey et al. because

their pipelines can only detect whether or not any planet transits a given star, not if multiple do. If a star in

the catalog is detected to have a transiting planet, the signal is sent through a pipeline that vets it for being

a true exoplanet or a false positive on several criteria. These include whether or not the signal is associated

with a known spacecraft or stellar activity and if the signal is the result of harmonics from a different signal.

Dressing et al. then conduct a Bayesian Markov Chain Monte Carlo analysis to fit a curve to the exoplanet

candidates determined by their pipeline.

8

Page 9: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

3.1 Bayesian inference

Bayesian inference is the derivation of a posterior probability resulting from a prior probability and

a likelihood function. The prior probability (or prior) is the probability distribution of an uncertain quantity

that expresses one’s beliefs about a quantity before relevant evidence is taken into account. The prior can be

created using past information, subjective assessment of data, or relevant principles. The likelihood function

represents the probability of an observed outcome given a certain set of parameter values. Thus, the posterior

probability is the conditional probability of a random event after relevant evidence is taken into account.

Bayesian inference follows Bayes’ theorem to compute the posterior probability:

P (H|E) =P (E|H) · P (H)

P (E), (1)

where H is the hypothesis, E is new data (evidence) that were not used to compute the prior probability,

P (H) is the prior probability, P (H|E) is the posterior probability, and P (E|H) is the likelihood function. In

practice, Bayes’ theorem can be applied iteratively: after applying the theorem to some observed evidence,

the resulting posterior probability can be treated as a prior used in the computation of a new posterior

probability from new evidence.

Monte Carlo methods are a computational algorithms that use repeated random sampling to obtain

numerical results. In principle, they can be used to solve any problem that has probabilistic interpretation.

The Markov Chain Monte Carlo is particularly useful when the probability distribution of a variable is

parameterized. A Monte Carlo method uses a random walk to evaluate an integrand at each step and count

that value towards an integral. A Markov Chain has this integrand as its equilibrium distribution.

4 Survey completeness

Survey completeness is, in essence, a measure of how accurately collected data represents the actual

distribution of sources. For instance, if one were to create a pipeline to detect planets transiting stars over

a given area of the sky, the completeness would be a measure of how many transiting planets the pipeline

detected versus how many transiting planets actually exist in that area. This value is significant because it

represents the reliability of a pipeline to recover signals from a catalog. It is also used in the calculation of

final statistics of the data: in this case, the occurrence rate. In the surveys covered in this paper, the survey

9

Page 10: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

completeness is measured in small bins of orbital period P and planet radius R. Several trends were noted

by each author.

(Petigura et al. 2013) note that survey completeness is a complicated function of period and radius.

In general, as P increases and R decreases, the function of the completeness decreases. This makes logical

sense, as planets with small radii would be eliminated as noise in the pipeline (and possibly undetected by

the Kepler survey), and planets with sufficiently long periods may not have time to transit during the course

of the pipeline. Petigura et al. also state that using noise models to determine the completeness function

instead of the injection and recovery method would not be recommended because these models would only

be able to determine relative completeness, when absolute completeness is the significant value.

(Foreman-Mackey et al. 2014) use the same catalog generated by Petigura et al. to recalculate

the completeness function. They use the injected signal samples from this catalog to calculate the detection

efficiency in bins of log period and log radius. This detection efficiency is just the fraction of recovered

injection signals per bin. The authors state that, in a more certain survey, the best way to calculate the

survey completeness would be in terms of radius ratio or signal-to-noise. Since the radius uncertainties

are dominated by uncertainties in the stellar parameters, however, it is impossible to compute constraints

on radius ratios and the best method available is to determine using period and radius. The next step

is to determine the geometric transit probability in the period-radius plane. This distribution scales only

with the period. Foreman-Mackey et al. assume that all planets in the catalog are on circular orbits. For

simplicity, this assumption is necessary in the calculations of both Petigura et al. and Foreman-Mackey et

al. However, a study by (Kipping 2014) showed that, when eccentric orbits are included, Foreman-Mackey

et al. underestimate the transit probability by 10%. This effect propagates to the occurrence rate density

as well.

(Dressing Charbonneau 2015) present a much less detailed explanation of their methods for deter-

mining survey completeness. Their method requires using “the detectability of a particular transiting planet

and the likelihood that a particular planet will be observed to transit”. They calculate the geometric prob-

ability of planet transit to determine the dependence on the second factor above. They do this for planets

at particular periods or insolation levels. Examining the dependence of the geometric transit probability on

insolation is beyond the scope of this paper. It is unclear why Dressing et al. do not include planet radius in

their calculations here, but I speculate that it is for similar reasons as to why (Foreman-Mackey et al. 2014)

find that the distribution of the geometric transit probability in the period-radius plane scales only with the

10

Page 11: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

period. They then divide the stellar radii by semimajor axes to determine the transit probability for a planet

on a circular orbit. Dressing et al. take one additional step that is not taken by Foreman-Mackey et al. or

Petigura et al. They incorporate the eccentricity correction factor presented by (Kipping 2014) to present a

more realistic distribution of planets.

5 Effects of differing assumptions

Each author selects a slightly different set of assumptions when deriving the planet occurrence

rate. The differences between these can have both drastic and insignificant effects on the results of their

calculations. I will present the assumptions used by each author, and then compare their effects.

(Petigura et al. 2013) make the following assumptions. They assume that the planet occurrence

rate is flat in log period, and that the occurrence is uniform per log period interval. The assumptions

regarding the flatness and uniformity of the occurrence rate with log period are necessary to be able to

fit a reasonable relation between the occurrence rate and bins of period and planet radius. One significant

assumption made by the authors is that the orbits of transiting planets are circular. Planets on circular orbits

have a higher probability of being observed to transit than those on elliptical orbits. Thus, the TERRA

pipeline will detect more planets than would be the case if orbital eccentricities were taken into account.

Additionally, it should be noted that Petigura et al.’s final results are based on extrapolations from the data

(in other words, they use a fitted model to derive some of the data that go into their final calculations).

This would likely be less accurate than performing a direct calculation for the result on data that is already

in the domain of concern. However, since further observation would need to be done to locate data in that

domain, their methods here are the most efficient as could be done using the available data.

Petigura et al. state that alternative definitions of the properties of Earth analogs and the domain

of the HZ may be adopted. They provide several estimates of the occurrence rate they calculated based on

different published definitions of the HZ.

(Foreman-Mackey et al. 2014) make several “strong” assumptions throughout their paper, but

they argue that they are weaker than the implicit assumptions made in previous studies. The assumptions

as stated by Foreman-Mackey are:

1. “Candidates in the catalog are independent draws from an inhomogeneous Poisson process set by the

censored occurrence rate density.”

11

Page 12: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

2. “Every candidate is a real exoplanet (there are no false positives).”

3. “The observational uncertainties on the physical parameters are non-negligible but known (the catalog

provides probabilistic constraints on the parameters).”

4. “The detection efficiency of the pipeline is known.”

5. “The true occurrence rate density is smooth.”

Foreman-Mackey et al. give a summary of the effects of their assumptions, as well. Their first

assumption states that each planet drawn from the catalog is independent of all the other planets in the

catalog. The reasoning behind this assumption is that since the data set they consider only includes systems

with single planets, no planet in the system will affect the parameters of another. The second assumption

states that all candidates in the catalog are real. However, other studies (Morton Johnson 2011) (Fressin

et al. 2013) demonstrate that the false positive rate in the Kepler catalog is non-negligible. Running

their calculations again while taking into account possible false positives would likely decrease the planet

occurrence rate found by Foreman-Mackey et al. Similar to Petigura et al., the authors here neglect orbital

eccentricities (although they do comment on why this skews their results).

(Dressing Charbonneau 2015) make very few explicit assumptions in their calculations. Like

Petigura et al., they assume that the planet occurrence rate is flat in log period. Like Foreman-Mackey et

al., they assume that there are no false positives in the catalog in their original 2013 paper. In their 2015

paper, however, Dressing et al. apply a correction for false positives. It appears that of the studies I am

examining, this is the only one to make this correction. Unlike Petigura et al. and Foreman-Mackey et al.,

Dressing et al. factor in orbital eccentricities, as demonstrated by (Kipping 2014).

Petigura et al. and Dressing et al. assume that the distribution of planets (and thus the planet

occurrence rate) is flat in a bin of log radius. (Foreman-Mackey et al. 2014) relax this assumption and only

assume that the occurrence rate density is a smooth function of period and radius. This allows them more

freedom to extrapolate their data.

6 Bias from paper intent

Each previous study has a slightly different overall intent, which could cause some bias in their

treatment of their calculations. For instance, if the primary objective of a study is to publish a catalog of

12

Page 13: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

data, the author(s) may make stronger assumptions or make much more general claims about the trends in

their data than a study that was focused on analyzing such a catalog.

The intent of the study by Petigura et al. is two-fold, but is primarily focused on one goal. The

bulk of the work done by the authors goes into creating their TERRA software package to identify Earth-size

planets in the Kepler data and create a catalog of these planets. Additionally, they present an analysis of

their data (they present occurrence rates for different sets of parameters). Since the primary focus of the

study is on creating the catalog of planets, the authors may make some of their more substantial assumptions

when calculating the occurrence rate.

Foreman-Mackey et al. focus on calculating the occurrence rate density using the catalog of planet

candidates published by (Petigura et al. 2013). Because their calculations only depend on the catalog data

and not the inferred statistics published by Petigura et al., it is likely that there is little bias present in

the calculations of Foreman-Mackey et al. The intent of the study by Foreman-Mackey et al. is to find the

occurrence rate density of planets in the catalog published by Petigura et al. using methods that require

fewer assumptions than previous studies.

The intent of the paper by Dressing et al. is more complicated than that of the papers by Petigura

et al. and Foreman-Mackey et al. because it is, in essence, a revisitation of their 2013 study. Their intent,

explicitly stated (Dressing Charbonneau 2015), is to “implement the following improvements to refine our

2013 estimate of the frequency of small planets around small stars.

1. We use the full Q0-Q17 Kepler data set.

2. We utilize archival spectroscopic and photometric observations to refine the stellar sample.

3. We explicitly measure the pipeline completeness.

4. We inspect follow-up observations of planet host stars to properly account for transit depth dilution

due to light from nearby stars.

5. We apply a correction for false positives in the planet candidate sample.

6. We incorporate a more sophisticated treatment of the HZ.”

Since the intent of this paper is to refine the methods utilized in their 2013 paper, it seems prudent to

examine the intent of the original paper. It appears that the goal of the 2013 study is to create a catalog of

transiting exoplanets from the Kepler data set using their own pipeline, as well as estimate the occurrence

13

Page 14: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

rate of planets with short orbits (P < 50 days) around cool stars (Dressing Charbonneau 2013). This intent

is similar to that of Petigura et al., so it is likely that any bias present in that paper would be present in

Dressing et al.’s 2013 paper.

7 Effects on other statistics

While the most significant statistic treated in these studies is the occurrence rate of small planets

around small stars, several others are presented as products of the varying methodologies used by the authors.

In some cases, these other statistics are used in the calculation of the value of the occurrence rate. In others,

they are used as parameters to constrain the domain of the occurrence rate. Petigura et al. use their

TERRA software to determine the intensity of stellar light energy received by a planet from its parent star.

To determine this value, they use the standard formula for stellar light flux, Fp = L?/4πa2, where Fp is the

flux received by the planet, L? is the luminosity of the star, and a is the planet-star separation (Petigura

et al. 2013). The luminosity of the star is computed as L? = 4πR2?σT

4eff , where σ is the Stefan-Boltzmann

constant. In this case, Petigura et al. use the fluxes determined for each star to constrain the domain of

their planet occurrence rate. They determine the occurrence rate for planets in the domain 1 − 2 R⊕ and

1 − 4 F⊕, where F⊕ is the flux received by Earth from the Sun. The flux and luminosity data obtained

by Perigura et al. is likely not tainted by strong assumptions because it could come from the Kepler light

curves before TERRA was run to exhume present exoplanet signals.

Since the study by Foreman-Mackey et al. applies the likelihood method, the nature of all of

the statistics presented by those authors is probabilistic. In other words, statistics that would have been

considered certain in the work of Petigura et al. are treated as uncertain by Foreman-Mackey et al. For

instance, Foreman-Mackey et al. use the probability of a transiting planet from the Kepler data set and

the inferred rate density to find the number of Earth-like planets transiting Sun-like stars in the catalog

published by Petigura et al. The number that they found was 10.6+5.9−4.5 (Foreman-Mackey et al. 2014).

The uncertainties on this value are only on the expectation value and do not include the Poisson sampling

variance, and thus leave much room for improving the precision of this value. This can be accomplished by

extending the search pipelines to small planets with long periods.

The most significant statistic presented by Dressing et al. that I have not discussed yet is the

insolation of planets. The insolation is, essentially, a measurement of the flux received by a planet from its

14

Page 15: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

host star. Dressing et al. present insolation as a constraint on the domain of the occurrence rate. In other

words, they use insolation as a definition of the habitable zone (the range around a star in which a planetary

surface can support liquid water under sufficient atmospheric pressure). According to the authors, the errors

on the insolation are large enough to produce a smooth distribution. While this is helpful in creating a

distribution which can be used to more precisely constrain the occurrence rate, large errors generally mean

that any given measurement inferred from the distribution would not be accurate.

8 Ramifications for future studies

Astronomy is a field built on collaboration. The nature of these types of studies is to build on

previous work and either develop new methods or sharpen methods already proposed. This is apparent in

the studies discussed in this paper. Petigura et al. lay much of the groundwork for the material discussed,

while Foreman-Mackey et al. propose an alternate method to discern the same statistics. Dressing et al.,

however, propose improvements on their original technique published in their original 2013 paper.

Future studies will almost definitely follow this same pattern. It is likely that after much further

study, a single superior method of determining occurrence rates will be decided upon. In the meantime,

the best course of action is for astronomers to further develop the likelihood and inverse-detection efficiency

methods until one produces statistically superior results. Ideally, observational techniques would eliminate

the need for assumptions to be made when discerning trends in data.

According to (Foreman-Mackey et al. 2014), for a full detailed analysis of planet occurrence rates,

the likelihood method should be applied instead of the inverse-detection method if uncertainties are not

significant. In a realistic case when catalog data is known to a much greater degree of precision and the

completeness function of a model varies much more smoothly, uncertainties will become less significant and

the inverse-detection-efficiency method would be prudent to utilize.

9 Conclusion

In this paper, I presented several comparisons between the statistical methodologies implemented

and assumptions made by different studies and their effects on the results of each study. The values in

question are the occurrence rate of exoplanets orbiting stars in the Kepler mission catalog. This is, essentially,

the probability of finding a planet orbiting any given star. The primary motivation for the studies was to

15

Page 16: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

determine occurrence rates for Earth-like planets orbiting Sun-like stars. I focused on the methods of three

studies:

1. Prevalence of Earth-size planets orbiting Sun-like stars, (Petigura et al. 2013)

2. Exoplanet population inference and the abundance of Earth analogs from noisy, incomplete catalogs,

(Foreman-Mackey et al. 2014)

3. The occurrence of potentially habitable planets orbiting M dwarfs estimated from the full Kepler

dataset and an empirical measurement of the detection sensitivity, (Dressing Charbonneau 2015)

Each study uses a different method for determining the occurrence rate. (Petigura et al. 2013)

use the inverse-detection-efficiency method, wherein they inject fake planet-transit signals into Kepler light

curves, recover the signals, and sort them into bins of log planet radius and log orbital period to determine the

planet occurrence rate. (Dressing Charbonneau 2015) apply the inverse-detection-efficiency method similar

to Petigura et al. Their pipeline, however, can also detect multiple planets transiting a single star, whereas

the pipeline used by Petigura et al. can only detect if any planets transit the star. (Foreman-Mackey et al.

2014) substitute the likelihood method in place of the inverse-detection-efficiency method. In this method,

each planet candidate is treated as an independent draw from an inhomogeneous Poisson set. The pipeline

yields an integral of the detection efficiency and true occurrence rate density functions, which is used to

calculate the occurrence rate density.

Each method yields a different value for the occurrence rate or occurrence rate density. The

inverse-detection-efficiency method used by Petigura et al. yields an occurrence rate of 5.7+1.7−2.2% for Earth-

size (1 − 2 R⊕) planets with periods of 200 − 400 d. Converted to an occurrence rate density, this value

is 0.119+0.046−0.035 nat

−2. Foreman-Mackey et al.’s likelihood method yields an occurrence rate density value

of 0.019+0.019−0.010 nat

−2 for Earth-analogs, which they define along the same lines as Petigura et al. Dressing

& Charbonneau find a value for the occurrence rate at 2.5 ± 0.2 for planets where R = 1 − 4 R⊕ and

P < 200 days.

The methods of Petigura et al. and Dressing & Charbonneau seem to yield numerically similar

results. While their margins of error do not intersect, both produce relatively low occurrence rates. This

should be expected as they use similar statistical methodologies. Both authors assume that the distribution

of planets in the survey is flat in bins of log orbital period and log planet radius. This is a necessary

assumption in order to correctly apply the inverse-detection-efficiency method and group into bins of log

16

Page 17: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

period and log radius.

Because Dressing & Charbonneau do not present their data in terms of occurrence rate density,

I can only compare the values given by Foreman-Mackey et al. to those of Petigura et al. Similar to the

results of Dressing & Charbonneau and Petigura et al., the error bars of the occurrence rate density values

of Foreman-Mackey et al. and Petigura et al. do not overlap. The occurrence rate density calculated by

Foreman-Mackey et al. is, in fact, roughly six times smaller than that of Petigura et al. This leads to the

suggestion that the likelihood method and the inverse-detection-efficiency method cannot produce similar

results if applied to the same set of data under similar conditions. This is profound because it means that

further refinement of each method should be done to determine if they are reconcilable. If not, a clear set of

domains should be determined for circumstances to use each method.

One of the most significant differences between the studies is the assumption that planet candi-

dates picked up by the pipelines have circular orbits. Petigura et al. and Foreman-Mackey et al. hold this

assumption in their calculations because their methodologies are not robust enough to handle eccentricity

variation. (Kipping 2014) published a study that determined a sizeable difference between survey complete-

ness calculations including elliptical orbits and including only circular orbits. In their study, Dressing &

Charbonneau incorporate the factor presented by Kipping to include eccentric orbits in their distribution.

Future studies should more extensively examine the effects of orbital eccentricity on the occurrence rate.

It is worth comparing the citation metrics of each of these studies. The study done by (Petigura et

al. 2013) has been cited 274 times since its publication. Those of (Foreman-Mackey et al. 2014) and (Dressing

Charbonneau 2015) have appeared as citations 78 and 107 times, respectively. It is likely that Petigura et

al.’s paper is having a greater impact because they published a reliable and extensively documented catalog.

Many papers that cite Petigura et al.’s attempt to improve on the statistical methodology of the original

study, so any chain of improvements from study to study will always lead back to Petigura et al. This trend

will almost asuredly continue in the future.

References

Petigura, E. A., Howard, A. W., Marcy, G. W. 2013, Proceedings of the National Academy of Sci-

ences, 110 (Proceedings of the National Academy of Sciences), 19273, http://dx.doi.org/10.1073/pnas.1319909110

Foreman-Mackey, D., Hogg, D. W., Morton, T. D. 2014, The Astrophysical Journal, 795 (IOP

17

Page 18: Effects of differing statistical methodologies on inferences about Earth-like exoplanet populations

Publishing), 64, http://dx.doi.org/10.1088/0004-637x/795/1/64

Dressing, C. D., Charbonneau, D. 2015, ApJ, 807 (IOP Publishing), 45, http://dx.doi.org/10.1088/0004-

637x/807/1/45

Dressing, C. D., Charbonneau, D. 2013, The Astrophysical Journal, 767 (IOP Publishing), 95,

http://dx.doi.org/10.1088/0004-637x/767/1/95

Kipping, D. M. 2014, Monthly Notices of the Royal Astronomical Society, 444 (Oxford University

Press (OUP), 2263, http://dx.doi.org/10.1093/mnras/stu1561

Morton, T. D., Johnson, J. A. 2011, The Astrophysical Journal, 738 (IOP Publishing), 170,

http://dx.doi.org/10.1088/0004-637x/738/2/170

Fressin, F., Torres, G., Charbonneau, D., et al. 2013, The Astrophysical Journal, 766 (IOP

Publishing), 81, http://dx.doi.org/10.1088/0004-637x/766/2/81

18