symbolic regression for the derivation of mathematical ... meeting... · symbolic regression for...

33
Symbolic regression for the derivation of mathematical models directly from data A.Murari 1 , M.Gelfusa 2 , P.Gaudio 2 , M.Lungaroni 2 , E.Peluso 2 1 Consorzio RFX Associazione EURATOM/ENEA per la Fusione. Padua, Italy 2 Associazione EURATOM-ENEA - University of Rome “Tor Vergata”, Roma, Italy

Upload: others

Post on 11-Nov-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Symbolic regression for the derivation of mathematical models directly from data

A.Murari1, M.Gelfusa2, P.Gaudio2, M.Lungaroni2, E.Peluso2

1Consorzio RFX Associazione EURATOM/ENEA per la Fusione. Padua, Italy 2 Associazione EURATOM-ENEA - University of Rome “Tor Vergata”, Roma, Italy

Page 2: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Complexity of Magnetised Plasmas

The complexity of magnetised plasmas is due to:

• A different matter state far from equilibrium

• Many variables, long range interactions

• Nonlinear phenomena, non separability

This leads to a hierarchy of descriptions of thermonuclear

plasmas (particle, kinetic, fluid) and to a plethora of ad hoc

models of limited applicability (This plurality of models is not a

fault per se but more a specific characteristic of complex

systems) Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 2

Page 3: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Diagnostics and Data

•JET can produce up to 55 Gbytes of data per shot (potentially about 1 Terabyte per day). Total Database: more than 350 Tbytes

•ATLAS can produce up to about 10 Petabytes of data per year

•Hubble Space Telescope sent to earth up to 5 Gbytes of data per day

•Commercial DVD 4.7 Gbytes.

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 3

Page 4: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

From Data to Models

When handling complex systems, theory formulation from data

can require several steps

Raw

Data

Selection

Target Data

Preprocessing

Model

Transformed

Data

Patterns

Data Mining

Model formulation

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 4

Page 5: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Outline

I. Symbolic regression/Genetic programming to

extract models from the data without “a priori”

assumptions about their mathematical form

and with a minimum of “human intervention”

II. Scaling laws (Pth to reach the H mode, tE)

III. Conclusions

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 5

Page 6: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Genetic Algorithms and Programs

• GAs and GPs are computational methods able to

solve complex optimization problems.

• Inspired by the genetic processes of living organisms.

• In nature, individuals of a population compete for

basic resources.

• Those individuals that achieve better surviving capabilities

have higher probabilities to generate descendants.

• GAs and GPs emulate this behavior.

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 6

Page 7: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

• GAs and GPs work with a population of individuals.

• Each individual represents a possible solution to

a problem.

• The quality of each individual (possible solution)

is evaluated in terms of a fitness function.

• A higher probability to have descendants is

assigned to individuals with better fitness

functions.

Genetic Algorithms and Programs I

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 7

Page 8: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Genetic Algorithms and Programs II

• Genes or Knowledge representation

(how to represent formulas)

• Fitness Function (FF)

• Criteria to validate the results

Three aspects are fundamental in Genetic

Programming:

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 8

Page 9: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

• Formulas are represented as

trees

• Btne20/sin(R)+(Bt-R)

This representation allows an easy implementation

of genetic steps: copy, mutation, cross over

Formulas

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 9

Page 10: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Cross-over

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 10

Page 11: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Function class List

Arithmetic constants,+,-,*,/

Trigonometric sin(xi), cos(xi),tan(xi),asin(xi),atan(xi)

Exponential exp(xi),log(xi),power(xi, xj), power(xi,c)

Squashing logistic(xi),step(xi),sign(xi),gauss(xi),tanh(xi), erf(xi),erfc(xi)

Boolean equal(xi, xj),less(xi, xj), less_or_equal(xi, xj), greater(xi, xj),

greater_or_equal(xi, xj), if(xi, xj, xk), and(xi, xj),or(xi, xj),xor(xi,

xj), not(xi)

Other min(xi, xj),max(xi, xj),mod(xi, xj),floor(xi),ceil(xi),

round(xi),abs(xi)

Genes: function nodes

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 11

Page 12: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Fitness Function: AIC & BIC

• Akaike Information Criterion (AIC):

• Bayesian Information Criterion (BIC):

L MSE (Mean Square Error) k number of parameters n number of observations

• The preferred model for AIC (BIC) criterion is the

one with the minimum value of AIC (BIC)

kLAIC 2log2

nkLBIC loglog2

Penalty for models with a higher number of parameters

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 12

Page 13: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Overview of the Method

• Identification of the functional form of the models with

symbolic regression via genetic programming

• Tuning of the parameters with non linear

fitting(confidence intervals)

• Validation of the results with KL divergence (AIC, BIC),

residuals, two current tests (extrapolability)

Murari A. et al Nucl. Fusion 53 (2013) 043001 (13pp)

Murari A. et al Nucl. Fusion 52 (2012) 063016 (13pp)

Computational complexity: on a simple 2-core machine (2,4GHz and

8GB RAM Linux machine) the search time takes 8 hours to

process the data of 1000 discharges of the global ITPA database.

Hundreds of thousands of models generated. Some calculations

performed also “in the cloud”. Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 13

Page 14: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

• In MCNF practically all scalings are power law

monomials

• Power laws do not present any form of saturation

and therefore tend to diverge

• Power laws are monotonic

• Power laws force the interaction between terms to

be multiplicative (strange exponents)

• Many different mechanisms can give rise to power

laws that are therefore not very informative about

the dynamics of the phenomena

Scalings and Power Laws

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 14

Page 15: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

The method has been applied to the scaling law of the energy confinement time in terms of dimensional quantities

PL1 5.62 ⋅ 10−2𝐼0.93𝐵0.15𝑛0.41𝑀0.19𝑅1.97𝜖0.58𝜅𝑎0.78𝑃−069

PL2 5.55 ⋅ 10−2𝐼0.75𝐵0.32𝑛0.35𝑀0.06𝑅2.0𝜖0.76𝜅𝑎1.14𝑃−062

NPL 0.03670.03660.0369 ⋅ 𝐼1.0060.996

1.016

𝑅1.7311.7121.751

𝜅𝑎1.4501.411

1.489

𝑃−0.735−0.744−0.727

ℎ 𝑛,𝐵

Table IV. Power laws (PLs) and Non Power Law model (NPL). PL1 is the IPB98(y,2) scaling while PL2 is

(EIV)[15]. The term h(n,B) is:

ℎ 𝑛,𝐵 = 𝑛0.4480.4360.460

⋅ 1 + 𝑒−9.403−9.694−9.112 ⋅

𝑛𝐵 −1.365−1.412

−1.318

−1

k AIC BIC KLD

PL1 10 -19416.86 -19362.86 0.0337

PL2 10 -19084.36 -19203.68. 0.0802

NPL 10 -19660.03 -19599.04 0.0254

ITPA Database of tE: dimensional quantities

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 115

Page 16: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Difference significant at high density

k AIC BIC KLD

PLs 10 -5842.20 -6720.79 5.895

NPL 10 -6005.91 -6975.85 3.829

Extrapolation to JET

ITPA Database of tE

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 16

Page 17: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

If the main equations governing the plasma behaviour

are invariant under a certain class of transformations,

also the scaling laws for these plasmas must show the

same invariances.

Dimensionless quantities

The traditional dimensionless quantities are therefore

derived by the symmetries of the Vlasov equations

The Symbolic regression via genetic programming

can also be used to identify dimensionless

quantities in a data driven approach instead of an

hypothesis driven approach.

Symbolic regression via genetic programming allows

finding scalings with dimensionless quantities

(traditional log regression does not)

Page 18: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

To verify the trends, the same approach has been applied to the confinement time in terms of dimensionless quantities

k AIC BIC MSE KLD AdPL1 9 0.3316

AdPL2 9 . 0.1875

AdNPL 14 0.0567

AdPL1

AdPL2

AdNPL

wt

wt

tE: scalings with dimensionless quantities

𝜏 ⋅ 𝜔𝑐𝑖 = 1.13 1.111.15 ⋅ 10−6 ⋅

𝜅𝑎1.931.70

2.12

𝛽0.370.330.41

𝑀0.570.460.67

𝜌2.192.162.22

𝜈0.400.390.42

𝑞0.160.080.23 − 0.072−0.085

−0.060 ⋅ 𝜅𝑎1.180.94

1.40

+

−0.009−0.011−0.006 ⋅ 𝑞1.080.94

1.21+ 0.150.13

0.17 ⋅ 𝑀0.07−0.050.19

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 18

Page 19: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

t: extrapolation to JET

To substantiate the extrapolability of the non power law scalings, the various scalings have been obtained for the small devices and

the residuals have been calculated for JET

Murari A et al, Plasma Phys. Control. Fusion, 57,014008, doi:10.1088/0741-3335/57/1/014008, (2015)

k AIC BIC MSE KLD AdPL1 9 −1650.59 −2533.00 0.5518 0.3316

AdPL2 9 −6034.11 −6744.95. 0.1157 0.1875

AdNPL 14 −13833.82 −13758.91 0.715 ⋅ 10−2 0.0567

Page 20: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Dimensionless quant. identification : Numerical tests

Original equation (used to generate the data)

Symbolic regression via genetic programming is provided

with the dimensional quantities necessary to identify

the traditional dimensionless quantities.

It is then applied to a database generated with the

dimensionless quantities. Noise up to 50%.

Equation identified by Symbolic regression with 30% of noise and using the dimensionless quantities

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 20

Page 21: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Database Quality: Pareto frontier

AdPl1 Eq(8) 𝟕.𝟐𝟏𝟒 ⋅ 𝟏𝟎−𝟖𝐌𝟎.𝟗𝟔𝛜𝟎.𝟕𝟑𝐤𝟑.𝟑𝛒 ∗−𝟐.𝟕𝟎 𝛃−𝟎.𝟗𝟎𝛎 ∗−𝟎.𝟎𝟏 𝐪−𝟑.𝟎

AdPl2 Eq.(9) 𝟏.𝟖𝟑𝟔 ⋅ 𝟏𝟎−𝟖 ⋅ 𝐌𝟎.𝟒𝟗𝐤𝟑.𝟑𝟒𝛒 ∗−𝟐.𝟔𝟕 𝛃−𝟎.𝟓𝟕𝛎−𝟎.𝟏𝟒𝐪−𝟏.𝟖𝟒𝛜−𝟎.𝟏𝟗

D4AdPL1 Eq.(10) 𝟎.𝟏𝟗𝟖𝟎.𝟏𝟗𝟓𝟎.𝟐𝟎𝟎 ⋅ 𝐁𝟎.𝟕𝟕𝟐𝟎.𝟔𝟗𝟖

𝟎.𝟖𝟒𝟓𝐚𝟏.𝟐𝟗𝟖𝟏.𝟏𝟒𝟓

𝟏.𝟒𝟓𝟏𝐈𝟎.𝟖𝟓𝟏𝟎.𝟕𝟖𝟖

𝟎.𝟗𝟏𝟓𝐌−𝟎.𝟕𝟐𝟗−𝟎.𝟖𝟐𝟗

−𝟎.𝟔𝟐𝟖

D4AdPL2 Eq.(11) 𝟎.𝟏𝟎𝟔𝟎.𝟏𝟎𝟒𝟎.𝟏𝟎𝟕 ⋅ 𝐁𝟏.𝟎𝟎𝟗𝟎.𝟗𝟑𝟎

𝟏.𝟎𝟖𝟖𝐚𝟏.𝟓𝟒𝟖𝟏.𝟑𝟗𝟎

𝟏.𝟕𝟎𝟔𝐤𝟏.𝟏𝟏𝟓𝟎.𝟗𝟕𝟐

𝟏.𝟐𝟓𝟕𝐈𝟎.𝟔𝟐𝟐𝟎.𝟓𝟒𝟑

𝟎.𝟔𝟗𝟗𝐌−𝟎.𝟕𝟏𝟓−𝟎.𝟖𝟑𝟖

−𝟎.𝟓𝟗𝟏

D4AdNPL1 Eq.(12) 𝟎.𝟐𝟗𝟗𝟎.𝟐𝟕𝟑𝟎.𝟑𝟐𝟔 ⋅ 𝐁𝟎.𝟕𝟒𝟕𝟎.𝟔𝟕𝟎

𝟎.𝟖𝟐𝟑𝐚𝟎.𝟖𝟏𝟖𝟎.𝟔𝟒𝟓

𝟎.𝟗𝟗𝟏𝐈𝟎.𝟗𝟓𝟗𝟎.𝟖𝟗𝟑

𝟏.𝟎𝟐𝟔𝐟(𝐌,𝐧)

D4AdNPL2 Eq.(13) 𝟎.𝟏𝟐𝟗𝟎.𝟎𝟖𝟗𝟎.𝟏𝟔𝟗 ⋅ 𝐁𝟎.𝟗𝟔𝟖𝟎.𝟖𝟗𝟎

𝟏.𝟎𝟒𝟓𝐚𝟏.𝟎𝟕𝟕𝟎.𝟗𝟎𝟏

𝟏.𝟐𝟓𝟑𝐈𝟎.𝟕𝟑𝟖𝟎.𝟔𝟓𝟖

𝟎.𝟖𝟏𝟕𝐤𝟎.𝟗𝟖𝟗𝟎.𝟖𝟒𝟖

𝟏.𝟏𝟑𝟏𝐌−𝟎.𝟔𝟒𝟕−𝟎.𝟕𝟕𝟕

−𝟎.𝟓𝟐𝟑𝐠(𝐧)

D4AdNPL3 Eq.(14) 𝟎.𝟏𝟖𝟐𝟎.𝟏𝟕𝟓𝟎.𝟏𝟖𝟖 ⋅ 𝐁𝟎.𝟗𝟎𝟖𝟎.𝟖𝟑𝟏

𝟎.𝟗𝟖𝟓𝐚𝟎.𝟗𝟏𝟏𝟎.𝟕𝟔𝟒

𝟏.𝟎𝟓𝟖𝐈𝟎.𝟒𝟑𝟓𝟎.𝟑𝟒𝟐

𝟎.𝟓𝟐𝟖𝐤𝟎.𝟖𝟓𝟗𝟎.𝟕𝟏𝟕

𝟏.𝟎𝟎𝟎𝐌−𝟎.𝟔𝟐𝟐−𝟎.𝟕𝟒𝟗

−𝟎.𝟒𝟗𝟓𝐡(𝐖,𝐧)

D4AdNPL4 Eq.(15) 𝟎.𝟏𝟕𝟒𝟎.𝟏𝟓𝟖𝟎.𝟏𝟖𝟓 ⋅ 𝐁𝟎.𝟗𝟕𝟓𝟎.𝟖𝟖𝟖

𝟏.𝟎𝟔𝟑𝐚𝟏.𝟏𝟎𝟖𝟎.𝟗𝟐𝟒

𝟏.𝟐𝟗𝟒𝐈𝟎.𝟕𝟑𝟎𝟎.𝟔𝟒𝟓

𝟎.𝟖𝟐𝟑𝐤𝟎.𝟗𝟗𝟑𝟎.𝟖𝟐𝟏

𝟎.𝟖𝟐𝟑𝐢(𝐌,𝐧)

Table II: Best scaling laws derived from the survey of the Pareto Frontier obtained with

symbolic regression via genetic programming;the above functions f,g,h,i(x) follow:

AdNPL1 Eq.(12) 𝐟 𝐌,𝐧 = 𝐌−𝟎.𝟖𝟗𝟒−𝟏.𝟐𝟔𝟑−𝟎.𝟓𝟐𝟒

𝟏 + 𝐞−𝟎.𝟖𝟗𝟗−𝟏.𝟐𝟒𝟓−𝟎.𝟓𝟓𝟑𝐌𝟏.𝟏𝟐𝟔𝟎.𝟏𝟑𝟐

𝟐.𝟏𝟐𝟎𝐧−𝟎.𝟕𝟒𝟑−𝟏.𝟏𝟓𝟔

−𝟎.𝟑𝟑𝟎

−𝟏

AdNPL2 Eq.(13) 𝐠 𝐧 = 𝟏 + 𝐞−𝟐.𝟔𝟏𝟎—𝟓.𝟔𝟏𝟓𝟎.𝟑𝟗𝟓 𝐧−𝟎.𝟓𝟔𝟗−𝟎.𝟕𝟐𝟔

−𝟎.𝟒𝟏𝟒

−𝟏

AdNPL3 Eq.(14) 𝐡 𝐖,𝐧 = 𝟏 + 𝐞−𝟏.𝟕𝟑𝟕−𝟐.𝟎𝟑𝟓−𝟏.𝟒𝟑𝟗𝐖𝟏.𝟑𝟕𝟕𝟏.𝟎𝟓𝟏

𝟏.𝟕𝟎𝟒𝐧−𝟏.𝟒𝟒𝟑−𝟏.𝟕𝟏𝟔

−𝟏.𝟏𝟕𝟏

−𝟏

AdNPL4 Eq.(15) 𝐢 𝐌,𝐧 = 𝐌−𝟎.𝟖𝟏𝟑−𝟏.𝟏𝟔𝟗−𝟎.𝟒𝟓𝟓

𝟏 + 𝐞−𝟎.𝟗𝟑𝟒−𝟏.𝟐𝟖𝟐−𝟎.𝟓𝟖𝟔𝐌𝟎.𝟕𝟕𝟏−𝟎.𝟑𝟎𝟗

𝟏.𝟖𝟓𝟏𝐧−𝟎.𝟕𝟎𝟒−𝟏.𝟐𝟓𝟖

−𝟎.𝟏𝟓𝟏

−𝟏

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 21

Page 22: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Models: statistical quality

Model Eq. τ @ITER s

PL1 (8) 3.66

Pl2 (9) 3.29

D4AdPL1 (10)

D4AdPL2 (11)

D4AdNPL1 (12)

D4AdNPL2 (13)

D4AdNPL3 (14)

D4AdNPL4 (15)

Model Eq. k AIC BIC MSE KLD

PL1 (8) 8 -1652.59 -2540.93 𝟎.𝟓𝟓𝟐 0.3312

Pl2 (9) 8 -6036.11 -6752.89 𝟎.𝟏𝟏𝟔 0.1875

D4AdPL1 (10) 5 -13542.346 -13540.276 0.00799 0.09598

D4AdPL2 (11) 6 -13590.778 -13597.021 0.00785 0.06858

D4AdNPL1 (12) 8 -13662.357 -13638.661 0.00764 0.06332

D4AdNPL2 (13) 8 -13690.555 -13678.387 0.00756 0.06854

D4AdNPL3 (14) 9 -13723.038 -13686.454 0.00747 0.05006

D4AdNPL4 (15) 9 -13692.845 -13675.780 0.00755 0.06945

The best models are not power laws and

they do not use the usual dimensionless

quantities.

They estimate a confinement time on ITER

between 2 and 3 s instead of above 3 s

but the best is again around 3 s.

Multiple solutions: lack of severity of the

database Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 22

Page 23: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Conclusions I

I. A methodology has been devised to extract mathematical

models directly from databases without assumptions on

their mathematical form and with minimal human

intervention.

II. The approach is based on symbolic regression via

genetic programming

III. The proposed method covers the entire process, from

the identification of the mathematical form of the models

to the assessment of their quality. Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 23

Page 24: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Conclusions I I

I. Symbolic regression via genetic programming allows

answering questions which cannot be even asked with

traditional tools: the best form of the equations, quality of

databases, selection of dimensionless quantities, etc.

II. Symbolic regression via genetic programming has been

used to derive scaling laws for the Pthres to access the H

mode and tE

III. The ITPA data base has been analysed and the non PL

expressions are better of about one order of magnitude (in

the Kullback-Leibler sense) compared to the power laws.

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 24

Page 25: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

I. The methodology proposed is fully general and can be

easily adjusted to the specific needs of the application

(KDE, errors, time series etc.)

Editors A.Murari and J.Vega.

Many Thanks to PPCF!!!!!

II. Since 2014, a toolbox for symbolic regression via genetic

programming is available in Matlab statistics package

Conclusions III

Andrea Murari | IAEA Fusion Data Processing | NIce | 2015 | Page 25

Page 26: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

ITPA Database Quality: Pareto frontier

FF

Model complexity

Only the best models

(diamonds) for each

level of complexity

are retained.

The database is not

selective enough

(not enough severity)

Only the dimensional quantities required to obtain the

dimensionless ones are given as input to Symbolic

Regression.

Page 27: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

• Power laws are among the most popular statistical

distributions (together with Gaussian and Poisson)

• They are typically written in form of power law monomials:

• Power laws have been applied in a variety of fields ranging

from physics to economics, to geology and social sciences

• Practically all scalings in Fusion are power laws: L-H

transition, confinement, SOL etc

Scaling laws as Power Law Monomials

Page 28: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

• Identification of the functional form of the models and

the most relevant independent variables with symbolic

regression via genetic programming without a priori

assumptions

• Proper treatment of the errors and confidence intervals

(assuming additive Gaussian noise but not necessarily)

• Checking of the parameters with non linear fitting robust

against collinearities (sophisticated gradient descent

methods)

• Validation of the results with KL divergence (AIC, BIC),

residuals

Advantages of the Method

Page 29: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Scalings

Behaviour with density no

reproducible with a power law.

C.Maggi EPS 2012

Scalings of the power threshold Pth to reach the H mode

• Experimental

distribution not really

a power law.

Page 30: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

t dimensionless: extrapolation to ITER

Trend versus elongation

Equation

AdNPL

AdPL1

AdPL2

Again the effect is potentially significant

The estimates of NPL and AdNPL for the entries

of the ITPA database. The general agreement is

evident systematic since they do not present any

discrepancy even if they are two independent

regressions.

Page 31: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

Validation criteria: Kullback-Leibler divergence

The Kullback–Leibler divergence is always non-

negative and zero if and only if p = q

p is interpreted as the reference pdf

(data), q the model

For distributions P and Q of a continuous random variable,

KL-divergence is defined to be the integral:

where p and q denote the densities of P and Q.

Page 32: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

• To study dynamical systems, the FF can be based

on the comparison of the partial derivatives

between pairs of variables in the model (d) and the

data (D).

Dynamical systems: fitness Function

• A model structure is preferred to another if the

derivatives between the dependent and independent

variables are more similar to the corresponding discrete

differences in the database.

FF

Page 33: Symbolic regression for the derivation of mathematical ... Meeting... · Symbolic regression for the derivation of mathematical models directly from data A.Murari1, M.Gelfusa 2, P.Gaudio

•Summarizing, Symbolic Regression (SR) is a technique

that uses GP (a genetic algorithm) to find out a pool of

mathematical individuals, so “functional forms” made up by:

•functions, e.g. sin(), cos() exp(), +, - * / ,…

•constants and variables.

•The results in this presentation have been obtained using

the following function:

Function class List

Arithmetic c (real and integer constants),+,-,*,/

Exponential exp(xi),log(xi), power(xi,c)

Squashing tanh(xi),

Trigonometric sin(xi), cos(xi), sinh(xi), cosh(xi)

List of basic elements used in the SR via GP

Types of function nodes included in the symbolic regression used to derive the

results presented in this paper, xi and xj are the generic independent variables.

Furthermore the general logistic function is defined as