data driven theory: how to derive mathematical models ... fusion … · the data and the estimates...
TRANSCRIPT
![Page 1: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/1.jpg)
Data Driven Theory: how to derive mathematical models directly from data
Authors: A. Murari, E. Peluso, M. Lungaroni, P.Gaudio and M. GelfusaMany thanks to J. Vega
![Page 2: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/2.jpg)
• The “scientific process” relies on the formulation of testable
predictions, which implies a dialectic relation between two
domains: conceptual and empirical.
Scientific cycle
• The deduction step is very well
formalised
• The induction step is more an
art than a science and would
benefit from: 1) more flexible
tools for knowledge discovery 2)
a more solid mathematization of
the procedures. Data Driven
Theory
1. A. Murari et al, Entropy 2017, 19, 569; doi:10.3390/e19100569
2. A. Murari et al Nuclear Fusion, Volume 57, Number 1 November 2016
3. A. Murari et al 2016 Nucl. Fusion 56 076008
4. A. Murari et al 2017 Nucl. Fusion 57 126057
5. A.Murari et at Nuclear Fusion, Volume 56, Number 2 (2015)
6. A. Murari et al Plasma Physics and Controlled Fusion (2015),57 (1),
1. A. Murari et al 2016, Nuclear Fusion 56
2. A. Murari et al. Nuclear. Fusion 57 (2017) 016024 2017,
3. A. Murari et al. Nuclear Fusion , 57, Number 12,
September 2017
4. A. Murari et al Nuclear Fusion, Volume 58, Number 5,
March 2018
![Page 3: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/3.jpg)
Data Deluge and Measurements
•The amount of data produced by modern societies is enormous
•JET can produce more than 55 Gbytes of data per shot (potentially about 1 Terabyte per day). Total Warehouse: almost 0.5 Petabytes
•ATLAS can produce up to about 10 Petabytes of data per year
•Hubble Space Telescope in its prime sent to earth up to 5 Gbytesof data per day
•Commercial DVD 4.7 Gbytes (Blue Ray 50 Gbytes).
These amounts of data cannot be analysed manually in a reliable way. Given the complexity of the
phenomena to be studied, there is scope for the development of new data analysis tools particularly in
support to theory formulation!!
![Page 4: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/4.jpg)
Given the complexity of the problems and the amount of data, the inference process needs to be properly structured.
Data Analysis: an overview
Raw
Data
Selection
Target Data
Preprocessing
Model
Transformed
Data
Patterns
Data Mining
Model formulation
Bayesian statistics for error
bars and machine learning
for feature extraction
Various machine learning tools
![Page 5: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/5.jpg)
Outline
I. Symbolic regression/Genetic programming to
extract models directly from the data for better
“physics fidelity” and interpretability
II. Scaling laws (energy confinement time tE ):
exploratory
III. Conclusions
![Page 6: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/6.jpg)
Model Formulation
Logical positioning of the technique
Available Validated
Pre-processed Data
(between about 1 and 10 Gbytes
Model
...Model formulation....
1. Traditional
fitting
2. Symbolic
Regression
![Page 7: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/7.jpg)
Traditional Fitting
A theoretical model of the independent physical quantity as a function of the regressors must be available.
)cos()sin( 211 xxxy ltheoretica +=
)cos()sin( 763
2154121
xxxy fittedbeto +=
Traditional Fitting
NN xxxx ,,...,, 121 −
![Page 8: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/8.jpg)
Symbolic Regression via Genetic Programming
• On the basis of the data available (selection of the dependent quantity and the regressors) the best mathematical model is provided by SR via GP
Symbolic Regression
)cos()sin( 763
2154121mod
xxxy SRby +=
datayNN xxxx ,,...,, 121 −
![Page 9: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/9.jpg)
Genetic Algorithms and Programmes
• GAs and GPs are computational methods (evolutionary
computation) able to solve complex optimization
problems.
• Inspired by the genetic processes of living organisms.
• In nature, individuals of a population compete for
basic resources.
• Those individuals that achieve better surviving capabilities have
higher probabilities to generate descendants.
• GAs and GPs emulate this behavior, in our
application for symbolic manipulation.
![Page 10: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/10.jpg)
• Standard procedure of SR via GP:
1- Generate a random population of individuals
(formulas).
2- Evaluate each individual of the population (formula)
with a fitness function (FF).
3- Select the best fitting individuals (parents) to create a
new population of trees (formulas).
4- Combine the genes (“crossover”) of the chosen
parents and implement mutations, obtaining “children”.
5- Repeat the steps 2 to 4 till an ending condition is
fulfilled.
Genetic Algorithms for Symbolic Regression
![Page 11: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/11.jpg)
• Genes or Knowledge representation
(how to represent formulas)
• Fitness Function (FF)
• Criteria to validate the results
Essential Aspects of Genetic Programming
Three aspects are fundamental in Genetic
Programming:
A. Murari et al Nuclear Fusion, Volume 52, Number 6 May 2012 doi.org/10.1088/0029-
5515/52/6/063016
A. Murari et al 2013 Nucl. Fusion 53 043001 doi:10.1088/0029-5515/53/4/043001
A.Murari et at Nuclear Fusion, Volume 56, Number 2 (2015) doi:10.1088/0029-
5515/56/2/026005
A. Murari et al Plasma Physics and Controlled Fusion (2015),57 (1), doi: 10.1088/0741-
3335/57/1/014008
![Page 12: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/12.jpg)
Formulas
• Formulas are represented as trees
• Trees have a very high representational capability
• This representation allows an easy implementation
of genetic steps: copy, mutation, cross over
![Page 13: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/13.jpg)
Basis functions
Function class List
Arithmetic constants,+,-,*,/
Trigonometric sin(xi), cos(xi),tan(xi),asin(xi),atan(xi)
Exponential exp(xi),log(xi),power(xi, xj), power(xi,c)
Squashing logistic(xi),step(xi),sign(xi),gauss(xi),tanh(xi), erf(xi),erfc(xi)
Boolean equal(xi, xj),less(xi, xj), less_or_equal(xi, xj), greater(xi, xj),
greater_or_equal(xi, xj), if(xi, xj, xk), and(xi, xj),or(xi, xj),xor(xi,
xj), not(xi)
Other min(xi, xj),max(xi, xj),mod(xi, xj),floor(xi),ceil(xi),
round(xi),abs(xi)
![Page 14: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/14.jpg)
Fitness Function: AIC & BIC
• Akaike Information Criterion (AIC):
• Bayesian Information Criterion (BIC):
L MSE (Mean Square Error of the residuals, the differences between the data and the estimates of the model)
k number of parameters
n number of observations
• The preferred model for AIC (BIC) criterion is the one with the
minimum value of AIC (BIC)
kLAIC 2log2 +=
nkLBIC loglog2 +=
Penalty for
models with
a higher
number of
parameters
![Page 15: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/15.jpg)
Identification of dimensionless quantities
Numerical exercises have been performed: synthetic data have
been generated using dimensionless equations but only the
dimensional quantities have been provided to SR, which has
always been able to identify the original dimensionless quantities.
A well-known law connecting dimensionless quantities in fluid
dynamics is
𝑃𝑒 = 𝑃𝑟 ⋅ 𝑅𝑒
The Peclet number Pe is quantifies the ratio between transferred heat by advection
and diffusion in a fluid. The Prandtl number Pr is defined as the ration between
kinematic and thermal diffusivity; the Reynold number Re takes into account the
relative importance of viscosity for internal layers of a fluid.
![Page 16: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/16.jpg)
Identification of dimensionless quantities
The above dimensionless quantities can be written as:
𝑅𝑒 =𝜌⋅𝑢⋅𝑑
𝜇Pr =
𝑐𝑝𝜇
𝑘
Where:
1. 𝜇 is the dynamic viscosity
2. 𝑘 is the thermal conductivity
3. 𝑐𝑝 is the specific heat
4. 𝜌 is the density
5. 𝑢 is the velocity of the fluid
6. 𝑑 is a characteristic linear dimension of the object moving in the fluid
![Page 17: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/17.jpg)
Identification of dimensionless quantities
Original equation: 𝑃𝑒 = 𝑃𝑟𝑅𝑒
A noise level up to 30% of the data has been added ot the
variables (with Gaussian distribution)
Equation identified by SR via GP: 𝑃𝑒 = 0.99 ⋅ 𝑃𝑟𝑅𝑒
SR via GP has good capability to extract the dimensionless
quantities relevant to the phenomenon under study
![Page 18: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/18.jpg)
Outline
I. Symbolic regression/Genetic programming to
extract models directly from the data with a
minimum of “a priori” assumptions about their
mathematical form and of “human intervention”
II. Scaling laws (energy confinement time tE ):
exploratory
III. Conclusions
![Page 19: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/19.jpg)
Energy Confinement Time
ipb98y2
[6]
NPL
AIC BIC MSE [10-3 s2] KLD
ipb98y2 -19416.86 -19362.86 1.866 0.0337
NPL -19660.03 -19599.04 1.724 0.0254
69.078.058.097.119.041.015.093.021062.5 −−= PkRMnBI aE t
),(1067.372.075.0
49.141.1
75.171.1
02.199.0 74.045.173.101.1269.3
66.3 BnhPkRI aE
−−−−=t
( )1
/40.945.032.141.1
37.111.969.9
46.044.0 1),(
−−
+=−−
−−− Bn
enBnh
[4]A. Murari, E. Peluso et al, Plasma Phys. Control. Fusion, 57(1), 2015, doi::/10.1088/0741-3335/57/1/014008[5] E. Peluso, A. Murari,,et al, 41st EPS Conference on Plasma Physics, 2014, P 2.029[6] McDonald D.C et al , Nucl. Fusion (2007), 47:147–174
Dimensional scaling law of confinement time (characteristic time measuring the rate at which the plasma loses energy)
ITPA database in Carbon. Comparison with traditional IPB98y2
The scaling law obtained with SR via GP is not a power law and has better statistical indicators
![Page 20: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/20.jpg)
0.301.090.070.2
3.373.096.08
1 1021.7q
kM aAdPL
t −=
+−= −
−
−40.194.0
23.003.0
42.039.0
22.216.2
67.046.0
41.035.0
12.270.1
18.1060.0
085.016.040.019.2
57.037.093.1
615.1
11.1 )072.0(10)13.1( aa
AdNPL kq
Mk
t
19.005.0
21.194.0 07.017.0
13.0
08.1006.0
011.0 15.0009.0 −−
−
−
− +− Mq
AIC BIC MSE KLD
ipb98y2-> AdPL1 -1650.59 -2533.00 0.55 0.33
AdNPL -13833.00 -13758.91 0.0072 0.056
• A similar analysis has been performed studying the non dimensional product between the confinement time τE and the ion Larmor gyro-frequency
• Actual independent scaling (no possible with log regression)
Energy Conf. Time: dimensionless quantities
![Page 21: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/21.jpg)
tE: extrapolation to JET
To substantiate the extrapolability of the non power law scalings, the various scalings have been obtained for the
small devices and the histograms of the residuals have been calculated for JET
k AIC BIC MSE KLD AdPL1 9 AdPL2 9 . AdNPL 14
![Page 22: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/22.jpg)
Scalings with dimensional and dimensionless regressors
Equation τ [s]
AdNPL
NPL
Power laws 3.66
3.16
2.782.973.31
2.422.83
Excellent independent match and ~20% reduction
Indipendent Scalings with dimensional and dimensionless
regressors: very good agreement
Extrapolation to ITER
Hundreds of thousands of models have been tested. agreement
![Page 23: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/23.jpg)
Outline
I. Symbolic regression/Genetic programming to
extract models directly from the data with a
minimum of “a priori” assumptions about their
mathematical form and of “human intervention”
II. Scaling laws (energy confinement time tE ):
exploratory
III. Conclusions
![Page 24: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/24.jpg)
Conclusions
Data driven methods1. They try to mathematize also the
phase of hypothesis formulation from observations and data (in analogy to hypothesis formulation from first principles)
2. They try to overcome the dichotomy between model testing and theory from first principles (and the division of labour)
3. They can hopefully lead to new discoveries (total more than the sum of the parts)
First Principles Data
The developed tools are meant to complement traditional theory formulation and computer simulations not to replace them.
![Page 25: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/25.jpg)
Thank You for
Your Attention!
QUESTIONS?
![Page 26: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/26.jpg)
Additional topics and future developments
– SR via GP presents various advantages compared to log regression: no constrained to produce power laws, no assumptions about the noise distribution, less vulnerable to collinearity etc.
– A priori information can be integrated at various levels: election of the basis functions, tree structure, correlation between branches etc.
– More advanced versions of SR via GP are available (Pareto Frontier, better treatment of the errors with Geodesic Distance on Gaussian Manifolds etc.)
– The same techniques can be applied to the results of complex simulations performed using supercomputers.
– Another interesting application is the support to experimental design
![Page 27: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/27.jpg)
• Identification of the functional form of the models with
symbolic regression via genetic programming
• Tuning of the parameters with non linear
fitting(confidence intervals)
• Validation of the results with KL divergence (AIC, BIC),
residuals, two current tests (extrapolability)
Summary of the method
Computational complexity: on a simple 2-core machine (2,4GHz and
8GB RAM Linux machine) the search time takes 8 hours to
process the data of 1000 discharges of the global ITPA database.
Hundreds of thousands of models generated. Some calculations
performed also “in the cloud”.
![Page 28: Data Driven Theory: how to derive mathematical models ... Fusion … · the data and the estimates of the model) k number of parameters n number of observations • The preferred](https://reader035.vdocuments.mx/reader035/viewer/2022062507/5fdb082cb370264d623f6126/html5/thumbnails/28.jpg)
Identification of dimensionless quantities
Original equation: 𝑃𝑒 = 𝑃𝑟𝑃𝑒 =⋅ 𝑐𝑝 ⋅ 𝜌 ⋅ 𝑢 ⋅ 𝑑/𝑘 𝑅𝑒
A noise level up to 30% of the data has been added ot the
variables (with Gaussian distribution)
Equation identified by SR via GP: 𝑃𝑒 = 0.99 ⋅ 𝑐𝑝 ⋅ 𝜌 ⋅ 𝑢 ⋅𝑑
𝑘 𝑅𝑒
SR via GP has good capability to extract the dimensionless
quantities relevant to the phenomenon under study