black-box confidence intervals excel and perl implementation

Upload: variavaria4319

Post on 13-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    1/16

    Confidence interval is abbreviated as CI. In this new article (part of our series

    on robust techniques for automated data science) we describe an implementa-

    tion both in Excel and Perl, and discuss our popular model-free confidence

    interval technique introduced in our original Analyticbridge article, as part ofour (open source) intellectual property sharing. This technique has the follow-

    ing advantages:

    Very easy to understand by non-statisticians (business analysts,

    software engineers, programmers, data architects)

    Black-box Confidence Intervals: Excel and

    Perl ImplementationPosted by Vincent Granville on August 8, 2014 at 12:00pmView Blog

    datasciencecentral.com

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    2/16

    Simple (if not basic) to code; no need to use tables of Gaussian,

    Student or other statistical distributions

    Robust, not sensitive to outliers

    Compatible with traditional techniques when data arises from a

    Gaussian (normal) distribution

    Model-independent, data-driven: no assumptions required about the

    data set; it works with non-normal data, and produces asymmetrical

    confidence intervals

    Therefore, suitable for black-box implementation or automated data

    science

    This is part of our series on data science techniques suitable for automation,

    usable by non-experts. The next one to be detailed (with source code) will be

    our Hidden Decision Trees.

    Figure 1: Confidence bands based on our CI (bold red and blue curves) - Com-

    parison with traditional normal model (light red anf blue curves)

    Figure 1 is based on simulated data that does not follow a normal distribution :

    see section 2 and Figure 2 in this article. Classical CI's are just based on 2

    parameters: mean and variance. With the classical model, all data sets with

    same mean and same variance have same CI's. To the contrary, our CI's are

    based on k parameters - average values computed on k different bins - see

    next section for details. In short, they are better predictive indicators when

    your data is not normal. Yet they are so easy to understand and compute, you

    don't even need to understand probability 101 to get started. The attached

    spreadsheet and Perl scripts have all computations done for you.

    1. General Framework

    We assume that we have n observations from a continuous or discrete varia-

    ble. We randomly assign a bin number to each observation: we create k bins (1

    k n) that have similar or identical sizes. We compute the average value in

    each bin, then we sort these averages. Let p(m) be the m-th lowest average

    (1 m k/2, with p(1) being the minimum average). Then our CI is defined as

    follows:

    Lower bound: p(m)

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    3/16

    Upper bound: p(k-m+1)

    Confidence level, also called level or CI level: equal to 1 - 2m/(k+1)

    The confidence level represents the probability that a new observation (from

    the same data set) will be between the lower and upper bounds of the CI. Note

    that this method produces asymetrical CI's. It is equivalent to designing per-

    centile-based confidence intervals on aggregated data. In practice, k is chosen

    much smaller than n, say k = SQRT(n). Also m is chosen to that 1 - 2m/(k+1) is

    as close as possible to a pre-specified confidence level, for instance 0.95. Note

    that the higher m, the more robust (outlier-nonsensitive) your CI.

    If you can't find m and k to satisfy level = 0.95 (say), then compute a few CI's

    (with different values of m), with confidence level close to 0.95. Then inperpo-

    late or extrapolate the lower and upper bounds to get a CI with 0.95 confi-dence level. The concept is easy to visualize if you look at Figure 1. Also, do

    proper cross-validation: split your data in two; compute CI's using the first

    half, and test them on the other half, to see if they still continue to have sense

    (same confidence level, etc.)

    CI's are extensively used in quality control, to check if a batch of new products

    (say, batteries) have failure rates, lifetime or other performance metrics that

    are within reason, that are acceptable. Or if wine advertised with 12.5% alco-

    hol content has an actual alcohol content reasonably close to 12.5% in each

    batch, year after year. By "acceptable" or "reasonable", we mean between the

    upper and lower bound of a CI with pre-specified confidence level. CI are also

    used in scoring algorithms, to provide CI to each score.The CI provides an

    indication about how accurate the score is. Very small confidence levels (that

    is, narrow CI's) corresponds to data well understood, with all sources of vari-

    ances perfectly explained. Converserly, large CI's mean lot's of noise and high

    individual variance in the data. Finally, if your data is stratified in multiple

    heterogeneous segments, compute separate CI's for each strata.

    That's it, no need to know even rudimentary statistical science to understand

    this CI concept, as well as the concept of hypothesis testing (derived from CI)

    explained below in section 3.

    When Big Data is Useful

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    4/16

    If you look closely at Figure 1, it's clear that you can't compute accurate CI's

    with a high (above 0.99) level, with just a small sample and (say) k=100 bins.

    The higher the level, the more volatile the CI. Typically, an 0.999 level

    requires 10,000 or more observations to get something stable. These high-

    level CI's are needed especially in the context of assessing failure rates, food

    quality, fraud detection or sound statistical litigation. There are ways to work

    with much smaller samples by combining 2 tests, see section 3.

    An advantage of big data is that you can create many different combinations of

    k bins (that is, test many values of m and k) to look at how the confidence

    bands in Figure 1 change depending on the bin selection - even allowing you to

    create CI's for these confidence bands, just like you could do with Bayesian

    models.

    2. Computations: Excel, Source Code

    The first step is to re-shuffle your datato make sure that your observations

    are in perfect random order: readA New Big Data Theoremsection in this arti-

    clefor an explanation why reshuffling is necessary (look at the second theo-

    rem). In short, you want to create bins that have the same mix of values: if the

    first half of your data set consisted of negative values, and the second half of

    positive values, you might end up with bins either filled with positive or nega-

    tive values. You don't want that; you want each bin to be well balanced.

    Reshuffling Step

    Unless you know that your data is in an arbitrary order (this is the case most

    frequently), reshuffling is recommended. Reshuffling can easily be performed

    as follows:

    Add a column or variable called RAN, made up of simulated random

    numbers, using a function such as 100,000 + INT(10,000*RAND())where RAND() returns a random number between 0 and 1.

    Sort your data by column RAN

    Delete column RAN

    Note that we use 100,000 + INT(10,000*RAND()) rather than just simply

    RAND() to make sure that all random numbers are integers with the same

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    5/16

    number of digits. This way, whether you sort alphabetically or numerically,

    the result will be identical, and correct. Sorting numbers of variable length

    alphabetically (without knowing it) is a source of many bugs in software engi-

    neering. This little trick helps you avoid this problem.

    If the order in your data set is very important, just add a column that has theoriginal rank attached to each observation (in your initial data set), and keep

    it through the res-shuffling process (after each observation has been assigned

    to a bin), so that you can always recover the original order if necessary, by

    sorting back according to this extra column.

    The Spreadsheet

    Download the Excel spreadsheet. Figures 1 and 2 are in the spreadsheet, as

    well as all CI computations, and more. The spreadsheet illustrates many not so

    well known but useful analytic Excel functions, such as: FREQUENCY, PER-

    CENTILE, CONFIDENCE.NORM, RAND, AVERAGEIF, MOD (for bin creations)

    and RANK. The CI computations are in cells O2:Q27 in the Confidence Inter-

    vals tab. You can modify the data in column B, and all CI's will automatically

    be re-computed. Beware if you change the number of bins (cell F2): this can

    screw up the RANK function in column J (some ranks will be missing) and then

    screw up the CI's.

    For other examples of great spreadsheet (from a tutorial point of view), check

    the Excel section in our data science cheat sheet.

    Simulated Data

    The simulated data in our Excel spreadsheet(see the data simulationtab), rep-

    resents a mixture of two uniform distributions, driven by the parameters in

    the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure

    2) were stored in column D, and were subsequently hard-copied into column Bin the Confidence Interval(results) tab (they still reside there), because other-

    wise, each time you modify the spreadsheet, new deviates produced by the

    RAND Excel function are automatically updated, changing everything and

    making our experiment non-reproducible. This is a drawback of Excel, thought

    I've heard that it is possible to freeze numbers produced by the function

    RAND. The simulated data is remarkably non-Gaussian, see Figure 2. It pro-

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    6/16

    vides a great example of data that causes big problems with traditional statis-

    tical science, as described in our following subsection.

    In any case, this is an interesting tutorial on how to generate simulated data in

    Excel. Other examples can be found in our Data Science Cheat Sheet(see Excel

    section).

    Comparison with Traditional Confidence Intervals

    We provide a comparison with standard CI's (available in all statistical pack-

    ages) in Figure 1, and in our spreadsheet. There are a few ways to compute

    traditional CI's:

    Simulate Gaussian deviates with pre-specified variance matching

    your data variance, by(1) generating (say) 10 million uniformdeviates on [-1, +1] using a great random generator, (2) randomly

    grouping these generated values in 10,000 buckets each having

    1,000 deviates, and (3) compute averages in each bucket. These

    10,000 averages will approximate very well a Gaussian distribution,

    all you need to do is to scale them so that the variance matches the

    variance in your data sets. And then compute intervals that contain

    99%, 95%, or 90% off all the scaled averages: these are your

    standard Gaussian CI's. Use libraries to simulate Gaussian deviates, rather than the cave-

    man appoach mentioned above. Source code and simulators can be

    found in books such as Numerical Recipes.

    In our Excel spreadsheet, we used the Confidence.norm function.

    As you can see in Figure 1, traditional CI's are very narrow. Note that inflating

    the traditional CI's by a factor SQRT(k), that is, replacing $F$6+R3

    by $F$6+SQRT($F$2)*R3 in cell S3 in our spreadsheet (and similar adjust-

    ments in all cells in columns S and T), leads to similar CI's. Indeed, traditional

    CI's have been designed for the mean, while ours are designed for bin aver-

    ages (that is, batch averages in quality control), or even individual values

    (when n=k).This explains most of the discrepancy. Finally, our methodology is

    better when n (the number of observations) is small (n < 100), or for high

    confidence levels (> 0.98) or when your data has outliers.

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    7/16

    Perl Code

    Here's some simple source code to compute CI for given m and k:

    Exercise: write the code in R or Python.

    3. Application to Statistical Testing

    Rather than usingp-values and other dangerous concepts(about to become

    extinct) that nobody but statisticians understand, here is an easy way to per-

    form statistical tests. The method below is part of what we call rebel statisti-

    cal science.

    Let's say that you want to test, with 99.5% confidence (level = 0.995),

    whether or not a wine manufacturer consistently produces a specific wine that

    has a 12.5% alcohol content. Maybe you are a lawyer, and the wine manufac-

    turer is accused of lying on the bottle labels (claiming that alcohol content is

    12.5% when indeed it is 13%), maybe to save some money. The test to perform

    is as follows: check out 100 bottles from various batches, and compute an

    0.995-level CI for alcohol content. Is 12.5% between the upper and lower

    bounds? Note that you might not be able to get an exact 0.995-level CI if your

    sample size n is too small (say n=100), you will have to extrapolate fromlower level CI's, but the reason here to use a high confidence level is to give

    the defendant the benefit of the doubtrather than wrongly accusing him based

    on a too small confidence level. If 12.5% is found inside even a small 0.50-level

    CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it

    will be inside an 0.995-level CI, because these CI's are nested (see Figure 1 to

    understand these ideas). Likewise, if the wine truly has a 13% alcohol content,

    a tiny 0.03-level CI containing the value 13% will be enough to prove it.

    One way to better answer these statistical tests (when your high-level CI's

    don't provide an answer) is to produce 2 or 3 tests (but no more, otherwise

    your results will be biased). Test whether the alcohol rate is

    As declared by the defendant (test #1)

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    8/16

    Equal to a pre-specified value (the median computed on a decent

    sample, test #2)

    4. Miscellaneous

    We include two figures in this section. The first one is about the data used inour test and Excel spreadsheet, to produce our confidence intervals. And the

    other figure shows the theorem that justifies the construction of our confi-

    dence intervals.

    Figure 2: Simulated data used to compute CI's: asymmetric mixture of non-

    normal distrubutions

    Figure 3: Theorem used to justify our confidence intervals

    Confidence interval is abbreviated as CI. In this new article (part of our series

    on robust techniques for automated data science) we describe an implementa-tion both in Excel and Perl, and discuss our popular model-free confidence

    interval technique introduced in our original Analyticbridge article, as part of

    our (open source) intellectual property sharing. This technique has the follow-

    ing advantages:

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    9/16

    Very easy to understand by non-statisticians (business analysts,

    software engineers, programmers, data architects)

    Simple (if not basic) to code; no need to use tables of Gaussian,

    Student or other statistical distributions

    Robust, not sensitive to outliers

    Compatible with traditional techniques when data arises from a

    Gaussian (normal) distribution

    Model-independent, data-driven: no assumptions required about the

    data set; it works with non-normal data, and produces asymmetrical

    confidence intervals

    Therefore, suitable for black-box implementation or automated data

    science

    This is part of our series on data science techniques suitable for automation,

    usable by non-experts. The next one to be detailed (with source code) will be

    our Hidden Decision Trees.

    Figure 1: Confidence bands based on our CI (bold red and blue curves) - Com-

    parison with traditional normal model (light red anf blue curves)

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    10/16

    Figure 1 is based on simulated data that does not follow a normal distribution :

    see section 2 and Figure 2 in this article. Classical CI's are just based on 2

    parameters: mean and variance. With the classical model, all data sets with

    same mean and same variance have same CI's. To the contrary, our CI's are

    based on k parameters - average values computed on k different bins - see

    next section for details. In short, they are better predictive indicators when

    your data is not normal. Yet they are so easy to understand and compute, you

    don't even need to understand probability 101 to get started. The attached

    spreadsheet and Perl scripts have all computations done for you.

    1. General Framework

    We assume that we have n observations from a continuous or discrete varia-

    ble. We randomly assign a bin number to each observation: we create k bins (1

    k n) that have similar or identical sizes. We compute the average value in

    each bin, then we sort these averages. Let p(m) be the m-th lowest average

    (1 m k/2, with p(1) being the minimum average). Then our CI is defined as

    follows:

    Lower bound: p(m)

    Upper bound: p(k-m+1)

    Confidence level, also called level or CI level: equal to 1 - 2m/(k+1)

    The confidence level represents the probability that a new observation (from

    the same data set) will be between the lower and upper bounds of the CI. Note

    that this method produces asymetrical CI's. It is equivalent to designing per-

    centile-based confidence intervals on aggregated data. In practice, k is chosen

    much smaller than n, say k = SQRT(n). Also m is chosen to that 1 - 2m/(k+1) is

    as close as possible to a pre-specified confidence level, for instance 0.95. Note

    that the higher m, the more robust (outlier-nonsensitive) your CI.

    If you can't find m and k to satisfy level = 0.95 (say), then compute a few CI's

    (with different values of m), with confidence level close to 0.95. Then inperpo-

    late or extrapolate the lower and upper bounds to get a CI with 0.95 confi-

    dence level. The concept is easy to visualize if you look at Figure 1. Also, do

    proper cross-validation: split your data in two; compute CI's using the first

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    11/16

    half, and test them on the other half, to see if they still continue to have sense

    (same confidence level, etc.)

    CI's are extensively used in quality control, to check if a batch of new products

    (say, batteries) have failure rates, lifetime or other performance metrics that

    are within reason, that are acceptable. Or if wine advertised with 12.5% alco-hol content has an actual alcohol content reasonably close to 12.5% in each

    batch, year after year. By "acceptable" or "reasonable", we mean between the

    upper and lower bound of a CI with pre-specified confidence level. CI are also

    used in scoring algorithms, to provide CI to each score.The CI provides an

    indication about how accurate the score is. Very small confidence levels (that

    is, narrow CI's) corresponds to data well understood, with all sources of vari-

    ances perfectly explained. Converserly, large CI's mean lot's of noise and high

    individual variance in the data. Finally, if your data is stratified in multipleheterogeneous segments, compute separate CI's for each strata.

    That's it, no need to know even rudimentary statistical science to understand

    this CI concept, as well as the concept of hypothesis testing (derived from CI)

    explained below in section 3.

    When Big Data is Useful

    If you look closely at Figure 1, it's clear that you can't compute accurate CI'swith a high (above 0.99) level, with just a small sample and (say) k=100 bins.

    The higher the level, the more volatile the CI. Typically, an 0.999 level

    requires 10,000 or more observations to get something stable. These high-

    level CI's are needed especially in the context of assessing failure rates, food

    quality, fraud detection or sound statistical litigation. There are ways to work

    with much smaller samples by combining 2 tests, see section 3.

    An advantage of big data is that you can create many different combinations of

    k bins (that is, test many values of m and k) to look at how the confidence

    bands in Figure 1 change depending on the bin selection - even allowing you to

    create CI's for these confidence bands, just like you could do with Bayesian

    models.

    2. Computations: Excel, Source Code

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    12/16

    The first step is to re-shuffle your datato make sure that your observations

    are in perfect random order: readA New Big Data Theoremsection in this arti-

    clefor an explanation why reshuffling is necessary (look at the second theo-

    rem). In short, you want to create bins that have the same mix of values: if the

    first half of your data set consisted of negative values, and the second half of

    positive values, you might end up with bins either filled with positive or nega-

    tive values. You don't want that; you want each bin to be well balanced.

    Reshuffling Step

    Unless you know that your data is in an arbitrary order (this is the case most

    frequently), reshuffling is recommended. Reshuffling can easily be performed

    as follows:

    Add a column or variable called RAN, made up of simulated random

    numbers, using a function such as 100,000 + INT(10,000*RAND())

    where RAND() returns a random number between 0 and 1.

    Sort your data by column RAN

    Delete column RAN

    Note that we use 100,000 + INT(10,000*RAND()) rather than just simply

    RAND() to make sure that all random numbers are integers with the same

    number of digits. This way, whether you sort alphabetically or numerically,the result will be identical, and correct. Sorting numbers of variable length

    alphabetically (without knowing it) is a source of many bugs in software engi-

    neering. This little trick helps you avoid this problem.

    If the order in your data set is very important, just add a column that has the

    original rank attached to each observation (in your initial data set), and keep

    it through the res-shuffling process (after each observation has been assigned

    to a bin), so that you can always recover the original order if necessary, by

    sorting back according to this extra column.

    The Spreadsheet

    Download the Excel spreadsheet. Figures 1 and 2 are in the spreadsheet, as

    well as all CI computations, and more. The spreadsheet illustrates many not so

    well known but useful analytic Excel functions, such as: FREQUENCY, PER-

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    13/16

    CENTILE, CONFIDENCE.NORM, RAND, AVERAGEIF, MOD (for bin creations)

    and RANK. The CI computations are in cells O2:Q27 in the Confidence Inter-

    vals tab. You can modify the data in column B, and all CI's will automatically

    be re-computed. Beware if you change the number of bins (cell F2): this can

    screw up the RANK function in column J (some ranks will be missing) and then

    screw up the CI's.

    For other examples of great spreadsheet (from a tutorial point of view), check

    the Excel section in our data science cheat sheet.

    Simulated Data

    The simulated data in our Excel spreadsheet(see the data simulationtab), rep-

    resents a mixture of two uniform distributions, driven by the parameters in

    the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure

    2) were stored in column D, and were subsequently hard-copied into column B

    in the Confidence Interval(results) tab (they still reside there), because other-

    wise, each time you modify the spreadsheet, new deviates produced by the

    RAND Excel function are automatically updated, changing everything and

    making our experiment non-reproducible. This is a drawback of Excel, thought

    I've heard that it is possible to freeze numbers produced by the function

    RAND. The simulated data is remarkably non-Gaussian, see Figure 2. It pro-

    vides a great example of data that causes big problems with traditional statis-

    tical science, as described in our following subsection.

    In any case, this is an interesting tutorial on how to generate simulated data in

    Excel. Other examples can be found in our Data Science Cheat Sheet(see Excel

    section).

    Comparison with Traditional Confidence Intervals

    We provide a comparison with standard CI's (available in all statistical pack-ages) in Figure 1, and in our spreadsheet. There are a few ways to compute

    traditional CI's:

    Simulate Gaussian deviates with pre-specified variance matching

    your data variance, by(1) generating (say) 10 million uniform

    deviates on [-1, +1] using a great random generator, (2) randomly

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    14/16

    grouping these generated values in 10,000 buckets each having

    1,000 deviates, and (3) compute averages in each bucket. These

    10,000 averages will approximate very well a Gaussian distribution,

    all you need to do is to scale them so that the variance matches the

    variance in your data sets. And then compute intervals that contain

    99%, 95%, or 90% off all the scaled averages: these are your

    standard Gaussian CI's.

    Use libraries to simulate Gaussian deviates, rather than the cave-

    man appoach mentioned above. Source code and simulators can be

    found in books such as Numerical Recipes.

    In our Excel spreadsheet, we used the Confidence.norm function.

    As you can see in Figure 1, traditional CI's are very narrow. Note that inflating

    the traditional CI's by a factor SQRT(k), that is, replacing $F$6+R3

    by $F$6+SQRT($F$2)*R3 in cell S3 in our spreadsheet (and similar adjust-

    ments in all cells in columns S and T), leads to similar CI's. Indeed, traditional

    CI's have been designed for the mean, while ours are designed for bin aver-

    ages (that is, batch averages in quality control), or even individual values

    (when n=k).This explains most of the discrepancy. Finally, our methodology is

    better when n (the number of observations) is small (n < 100), or for high

    confidence levels (> 0.98) or when your data has outliers.

    Perl Code

    Here's some simple source code to compute CI for given m and k:

    Exercise: write the code in R or Python.

    3. Application to Statistical Testing

    Rather than usingp-values and other dangerous concepts(about to become

    extinct) that nobody but statisticians understand, here is an easy way to per-

    form statistical tests. The method below is part of what we call rebel statisti-

    cal science.

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    15/16

    Let's say that you want to test, with 99.5% confidence (level = 0.995),

    whether or not a wine manufacturer consistently produces a specific wine that

    has a 12.5% alcohol content. Maybe you are a lawyer, and the wine manufac-

    turer is accused of lying on the bottle labels (claiming that alcohol content is

    12.5% when indeed it is 13%), maybe to save some money. The test to perform

    is as follows: check out 100 bottles from various batches, and compute an

    0.995-level CI for alcohol content. Is 12.5% between the upper and lower

    bounds? Note that you might not be able to get an exact 0.995-level CI if your

    sample size n is too small (say n=100), you will have to extrapolate from

    lower level CI's, but the reason here to use a high confidence level is to give

    the defendant the benefit of the doubtrather than wrongly accusing him based

    on a too small confidence level. If 12.5% is found inside even a small 0.50-level

    CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it

    will be inside an 0.995-level CI, because these CI's are nested (see Figure 1 to

    understand these ideas). Likewise, if the wine truly has a 13% alcohol content,

    a tiny 0.03-level CI containing the value 13% will be enough to prove it.

    One way to better answer these statistical tests (when your high-level CI's

    don't provide an answer) is to produce 2 or 3 tests (but no more, otherwise

    your results will be biased). Test whether the alcohol rate is

    As declared by the defendant (test #1)

    Equal to a pre-specified value (the median computed on a decent

    sample, test #2)

    4. Miscellaneous

    We include two figures in this section. The first one is about the data used in

    our test and Excel spreadsheet, to produce our confidence intervals. And the

    other figure shows the theorem that justifies the construction of our confi-

    dence intervals.

    Figure 2: Simulated data used to compute CI's: asymmetric mixture of non-

    normal distrubutions

    Figure 3: Theorem used to justify our confidence intervals

  • 7/24/2019 Black-box Confidence Intervals Excel and Perl Implementation

    16/16

    2015 Data Science Central