data driven theory: how to derive mathematical models ... fusion … · the data and the estimates...

Data Driven Theory: how to derive mathematical models directly from data

Authors: A. Murari, E. Peluso, M. Lungaroni, P.Gaudio and M. GelfusaMany thanks to J. Vega

• The “scientific process” relies on the formulation of testable

predictions, which implies a dialectic relation between two

domains: conceptual and empirical.

Scientific cycle

• The deduction step is very well

formalised

• The induction step is more an

art than a science and would

benefit from: 1) more flexible

tools for knowledge discovery 2)

a more solid mathematization of

the procedures. Data Driven

Theory

1. A. Murari et al, Entropy 2017, 19, 569; doi:10.3390/e19100569

2. A. Murari et al Nuclear Fusion, Volume 57, Number 1 November 2016

3. A. Murari et al 2016 Nucl. Fusion 56 076008

4. A. Murari et al 2017 Nucl. Fusion 57 126057

5. A.Murari et at Nuclear Fusion, Volume 56, Number 2 (2015)

6. A. Murari et al Plasma Physics and Controlled Fusion (2015),57 (1),

1. A. Murari et al 2016, Nuclear Fusion 56

2. A. Murari et al. Nuclear. Fusion 57 (2017) 016024 2017,

3. A. Murari et al. Nuclear Fusion , 57, Number 12,

September 2017

4. A. Murari et al Nuclear Fusion, Volume 58, Number 5,

March 2018

http://iopscience.iop.org/journal/0029-5515

http://iopscience.iop.org/volume/0029-5515/57

http://iopscience.iop.org/issue/0029-5515/57/1

http://iopscience.iop.org/0029-5515

http://iopscience.iop.org/0029-5515/56

http://iopscience.iop.org/0029-5515/56/2






Data Deluge and Measurements

•The amount of data produced by modern societies is enormous

•JET can produce more than 55 Gbytes of data per shot (potentially about 1 Terabyte per day). Total Warehouse: almost 0.5 Petabytes

•ATLAS can produce up to about 10 Petabytes of data per year

•Hubble Space Telescope in its prime sent to earth up to 5 Gbytesof data per day

•Commercial DVD 4.7 Gbytes (Blue Ray 50 Gbytes).

These amounts of data cannot be analysed manually in a reliable way. Given the complexity of the

phenomena to be studied, there is scope for the development of new data analysis tools particularly in

support to theory formulation!!

Given the complexity of the problems and the amount of data, the inference process needs to be properly structured.

Data Analysis: an overview

Raw

Data

Selection

Target Data

Preprocessing

Model

Transformed

Data

Patterns

Data Mining

Model formulation

Bayesian statistics for error

bars and machine learning

for feature extraction

Various machine learning tools

Outline

I. Symbolic regression/Genetic programming to

extract models directly from the data for better

“physics fidelity” and interpretability

II. Scaling laws (energy confinement time tE ):

exploratory

III. Conclusions

Model Formulation

Logical positioning of the technique

Available Validated

Pre-processed Data

(between about 1 and 10 Gbytes

Model

...Model formulation....

1. Traditional

fitting

2. Symbolic

Regression

Traditional Fitting

A theoretical model of the independent physical quantity as a function of the regressors must be available.

)cos()sin( 211 xxxy ltheoretica +=

)cos()sin( 763

2154121

xxxy fittedbeto +=

Traditional Fitting

NN xxxx ,,...,, 121 −

Symbolic Regression via Genetic Programming

• On the basis of the data available (selection of the dependent quantity and the regressors) the best mathematical model is provided by SR via GP

Symbolic Regression

)cos()sin( 763

2154121mod

xxxy SRby +=

datayNN xxxx ,,...,, 121 −

Genetic Algorithms and Programmes

• GAs and GPs are computational methods (evolutionary

computation) able to solve complex optimization

problems.

• Inspired by the genetic processes of living organisms.

• In nature, individuals of a population compete for

basic resources.

• Those individuals that achieve better surviving capabilities have

higher probabilities to generate descendants.

• GAs and GPs emulate this behavior, in our

application for symbolic manipulation.

• Standard procedure of SR via GP:

1- Generate a random population of individuals

(formulas).

2- Evaluate each individual of the population (formula)

with a fitness function (FF).

3- Select the best fitting individuals (parents) to create a

new population of trees (formulas).

4- Combine the genes (“crossover”) of the chosen

parents and implement mutations, obtaining “children”.

5- Repeat the steps 2 to 4 till an ending condition is

fulfilled.

Genetic Algorithms for Symbolic Regression

• Genes or Knowledge representation

(how to represent formulas)

• Fitness Function (FF)

• Criteria to validate the results

Essential Aspects of Genetic Programming

Three aspects are fundamental in Genetic

Programming:

A. Murari et al Nuclear Fusion, Volume 52, Number 6 May 2012 doi.org/10.1088/0029-

5515/52/6/063016

A. Murari et al 2013 Nucl. Fusion 53 043001 doi:10.1088/0029-5515/53/4/043001

A.Murari et at Nuclear Fusion, Volume 56, Number 2 (2015) doi:10.1088/0029-

5515/56/2/026005

A. Murari et al Plasma Physics and Controlled Fusion (2015),57 (1), doi: 10.1088/0741-

3335/57/1/014008




https://doi.org/10.1088/0029-5515/52/6/063016

http://dx.doi.org/10.1088/0029-5515/53/4/043001



http://iopscience.iop.org/0029-5515/56/2



http://dx.doi.org/10.1088/0741-3335/57/1/014008

Formulas

• Formulas are represented as trees

• Trees have a very high representational capability

• This representation allows an easy implementation

of genetic steps: copy, mutation, cross over

Basis functions

Function class List

Arithmetic constants,+,-,*,/

Trigonometric sin(xi), cos(xi),tan(xi),asin(xi),atan(xi)

Exponential exp(xi),log(xi),power(xi, xj), power(xi,c)

Squashing logistic(xi),step(xi),sign(xi),gauss(xi),tanh(xi), erf(xi),erfc(xi)

Boolean equal(xi, xj),less(xi, xj), less_or_equal(xi, xj), greater(xi, xj),

greater_or_equal(xi, xj), if(xi, xj, xk), and(xi, xj),or(xi, xj),xor(xi,

xj), not(xi)

Other min(xi, xj),max(xi, xj),mod(xi, xj),floor(xi),ceil(xi),

round(xi),abs(xi)

Fitness Function: AIC & BIC

• Akaike Information Criterion (AIC):

• Bayesian Information Criterion (BIC):

L MSE (Mean Square Error of the residuals, the differences between the data and the estimates of the model)

k number of parameters

n number of observations

• The preferred model for AIC (BIC) criterion is the one with the

minimum value of AIC (BIC)

kLAIC 2log2 +=

nkLBIC loglog2 +=

Penalty for

models with

a higher

number of

parameters

Identification of dimensionless quantities

Numerical exercises have been performed: synthetic data have

been generated using dimensionless equations but only the

dimensional quantities have been provided to SR, which has

always been able to identify the original dimensionless quantities.

A well-known law connecting dimensionless quantities in fluid

dynamics is

𝑃𝑒 = 𝑃𝑟 ⋅ 𝑅𝑒

The Peclet number Pe is quantifies the ratio between transferred heat by advection

and diffusion in a fluid. The Prandtl number Pr is defined as the ration between

kinematic and thermal diffusivity; the Reynold number Re takes into account the

relative importance of viscosity for internal layers of a fluid.


The above dimensionless quantities can be written as:

𝑅𝑒 =𝜌⋅𝑢⋅𝑑

𝜇Pr =

𝑐𝑝𝜇

𝑘

Where:

1. 𝜇 is the dynamic viscosity

2. 𝑘 is the thermal conductivity

3. 𝑐𝑝 is the specific heat

4. 𝜌 is the density

5. 𝑢 is the velocity of the fluid

6. 𝑑 is a characteristic linear dimension of the object moving in the fluid


Original equation: 𝑃𝑒 = 𝑃𝑟𝑅𝑒

A noise level up to 30% of the data has been added ot the

variables (with Gaussian distribution)

Equation identified by SR via GP: 𝑃𝑒 = 0.99 ⋅ 𝑃𝑟𝑅𝑒

SR via GP has good capability to extract the dimensionless

quantities relevant to the phenomenon under study

Outline


extract models directly from the data with a

minimum of “a priori” assumptions about their

mathematical form and of “human intervention”


exploratory

III. Conclusions

Energy Confinement Time

ipb98y2

[6]

NPL

AIC BIC MSE [10-3 s2] KLD

ipb98y2 -19416.86 -19362.86 1.866 0.0337

NPL -19660.03 -19599.04 1.724 0.0254

69.078.058.097.119.041.015.093.021062.5 −−= PkRMnBI aE t

),(1067.372.075.0

49.141.1

75.171.1

02.199.0 74.045.173.101.1269.3

66.3 BnhPkRI aE

−−−−=t

( )1

/40.945.032.141.1

37.111.969.9

46.044.0 1),(

−−

+=−−

−−− Bn

enBnh

[4]A. Murari, E. Peluso et al, Plasma Phys. Control. Fusion, 57(1), 2015, doi::/10.1088/0741-3335/57/1/014008[5] E. Peluso, A. Murari,,et al, 41st EPS Conference on Plasma Physics, 2014, P 2.029[6] McDonald D.C et al , Nucl. Fusion (2007), 47:147–174

Dimensional scaling law of confinement time (characteristic time measuring the rate at which the plasma loses energy)

ITPA database in Carbon. Comparison with traditional IPB98y2

The scaling law obtained with SR via GP is not a power law and has better statistical indicators

0.301.090.070.2

3.373.096.08

1 1021.7q

kM aAdPL

t −=

+−= −

−

−40.194.0

23.003.0

42.039.0

22.216.2

67.046.0

41.035.0

12.270.1

18.1060.0

085.016.040.019.2

57.037.093.1

615.1

11.1 )072.0(10)13.1( aa

AdNPL kq

Mk

t

19.005.0

21.194.0 07.017.0

13.0

08.1006.0

011.0 15.0009.0 −−

−

−

− +− Mq

AIC BIC MSE KLD

ipb98y2-> AdPL1 -1650.59 -2533.00 0.55 0.33

AdNPL -13833.00 -13758.91 0.0072 0.056

• A similar analysis has been performed studying the non dimensional product between the confinement time τE and the ion Larmor gyro-frequency

• Actual independent scaling (no possible with log regression)

Energy Conf. Time: dimensionless quantities

tE: extrapolation to JET

To substantiate the extrapolability of the non power law scalings, the various scalings have been obtained for the

small devices and the histograms of the residuals have been calculated for JET

k AIC BIC MSE KLD AdPL1 9 AdPL2 9 . AdNPL 14

Scalings with dimensional and dimensionless regressors

Equation τ [s]

AdNPL

NPL

Power laws 3.66

3.16

2.782.973.31

2.422.83

Excellent independent match and ~20% reduction

Indipendent Scalings with dimensional and dimensionless

regressors: very good agreement

Extrapolation to ITER

Hundreds of thousands of models have been tested. agreement

Outline


extract models directly from the data with a

minimum of “a priori” assumptions about their

mathematical form and of “human intervention”


exploratory

III. Conclusions

Conclusions

Data driven methods1. They try to mathematize also the

phase of hypothesis formulation from observations and data (in analogy to hypothesis formulation from first principles)

2. They try to overcome the dichotomy between model testing and theory from first principles (and the division of labour)

3. They can hopefully lead to new discoveries (total more than the sum of the parts)

First Principles Data

The developed tools are meant to complement traditional theory formulation and computer simulations not to replace them.

Thank You for

Your Attention!

QUESTIONS?

Additional topics and future developments

– SR via GP presents various advantages compared to log regression: no constrained to produce power laws, no assumptions about the noise distribution, less vulnerable to collinearity etc.

– A priori information can be integrated at various levels: election of the basis functions, tree structure, correlation between branches etc.

– More advanced versions of SR via GP are available (Pareto Frontier, better treatment of the errors with Geodesic Distance on Gaussian Manifolds etc.)

– The same techniques can be applied to the results of complex simulations performed using supercomputers.

– Another interesting application is the support to experimental design

• Identification of the functional form of the models with

symbolic regression via genetic programming

• Tuning of the parameters with non linear

fitting(confidence intervals)

• Validation of the results with KL divergence (AIC, BIC),

residuals, two current tests (extrapolability)

Summary of the method

Computational complexity: on a simple 2-core machine (2,4GHz and

8GB RAM Linux machine) the search time takes 8 hours to

process the data of 1000 discharges of the global ITPA database.

Hundreds of thousands of models generated. Some calculations

performed also “in the cloud”.


Original equation: 𝑃𝑒 = 𝑃𝑟𝑃𝑒 =⋅ 𝑐𝑝 ⋅ 𝜌 ⋅ 𝑢 ⋅ 𝑑/𝑘 𝑅𝑒

A noise level up to 30% of the data has been added ot the

variables (with Gaussian distribution)

Equation identified by SR via GP: 𝑃𝑒 = 0.99 ⋅ 𝑐𝑝 ⋅ 𝜌 ⋅ 𝑢 ⋅𝑑

𝑘 𝑅𝑒

SR via GP has good capability to extract the dimensionless

quantities relevant to the phenomenon under study

data driven theory: how to derive mathematical models ... fusion … · the data and the estimates...

Documents