optimizing access to application data for analysis …syntax similar to c, c++, java dataset define...
TRANSCRIPT
![Page 1: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/1.jpg)
DReplOptimizing Access to Application Data for
Analysis and Visualization
L a t c h e s a r I o n k o vM i c h a e l L a n g
L A N L
C a r l o s M a l t z a h nU C S C
![Page 2: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/2.jpg)
HPC Cluster
Head Node
CN1 CN2 CNk...
CNk+1 CNk+2 CNl...
CNl+1 CNl+2 CNm...
CNm+1 CNm+2 CNn...
IO1
IO2
IO3
IOp
FS
FS
FS
FS
FS
Desktop Desktop Desktop Desktop
![Page 3: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/3.jpg)
Data Storage
Data stored in files
Many applications use legacy formats
Data is stored in format, convenient for the producer
In-situ and in-transit data analysis slow
![Page 4: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/4.jpg)
Objective
Decouple storage data layout from application data layout(s)
Make replicas with different data layouts
Each application working with the data can use a layout that is optimized for it
Allow both materialized (on-storage) and on-the-fly data layouts
![Page 5: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/5.jpg)
DesignDefinitions
Dataset -- abstract data model
Views -- how applications see the data
Replicas -- how the data is stored
Provision of an easy way to express how data is used by the applications
View View
Dataset
Replica Replica Replica
![Page 6: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/6.jpg)
Example
pressure=5.1temp=33.1density=0.4
N.........
......
Mpsim
Dataset
pressure=5.1temp=33.1density=0.4
N.........
......
Mpsim
N.........
......
Mpressure
N.........
......
Mtemp
N.........
......
Mdensity
N.........
......
M
pressure
N.........
......
M
tempN
...
...
......
...
Mdensity
pressure=5.1density=0.4N
...
...
......
...
Mpsim
View 1 View 2
Replica 1
Replica 2
![Page 7: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/7.jpg)
DRepl
Dataset Language
Parser
Replication Engine
File Server
DRepl
dataset { var p struct { a, b, c float32 }}
view default { var p = p }view viz { var pa { a } = p var pba { b, a } = p}
replica default { view default }replica viz { view viz }
Parser
Replication EngineDReplFS
ReplicasReplicas
ReplicasOS
SimulationVizualization
viz
default
S viz:000000
S default:000004
dest
S viz:000004
S default:000000
dest
S viz:000000dest
destT viz:000004
field
field
T viz:000000dest
T default:000000
dest
fielddest
dest
S default:000008dest
dest
field
field
field
DatasetDescription
ConversionMap
![Page 8: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/8.jpg)
Configurations
Burst Buffer Node
Parallel FS
Replica 1
Replica 2
DRepl
...
...
Parallel FS
Replica 1
Replica 2Replica 3
ApplicationDRepl
Node 2
ApplicationDRepl
Node 1
ApplicationDRepl
Node N
ApplicationNode 1
ApplicationNode 2
ApplicationNode N
Application
DR
epl
Node 1
Application
DR
epl
Node 2
Application
DR
epl
Node N
Parallel FS
Replica 1
Replica 2...
Embedded Separate
Burst Buffer
![Page 9: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/9.jpg)
Dataset Language
Syntax Similar to C, C++, Java
Dataset
define data types (structs, arrays)
define named data of the types
View(s)
define substructs and subarrays
define named data based on the dataset data
Replica(s)
![Page 10: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/10.jpg)
Dataset LanguagePrimary types - int8, int16, int32, int64, float32, float64, stringN
Structsstruct { a, b, c float64}
Multidimensional arrays[50,40,21] Point
Custom typestype int64 Point
Arithmetic expressions in the subarray definitionsa[i*3, j + 2] = aa[j, i - 1]
Support for different array orders -- row-major, row-minor, in future Hilbert and z-order
![Page 11: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/11.jpg)
Language Exampledataset {const N = 500
type Data struct {a, b, c float32
}
var data [N]Data}
view array-of-structs { var ds = data}
view struct-of-arrays { var a[i]{a} = data[i] var b[i]{b} = data[i] var c[i]{c} = data[i]}
view ab rowmajor { var ab[i]{a,b} = data[i]}
replica array-of-structs { view array-of-structs}
replica struct-of-arrays { view struct-of-arrays}
replica other { view array-of-structs view ab}
![Page 12: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/12.jpg)
Subarray Examplesdataset {const N = 500const M = 200
var data [N, M]float32}
view v { // flip dimensions var flip[i,j] = data[j,i]
// middle row var mr[i] = data[N/2, i]
// each third element var te[i, j] = data[i*3, j*3]}
![Page 13: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/13.jpg)
DReplFS
Represent the application data formats (views) as virtual files
Stored data formats (replicas) -- collection of replicas
DReplFS
ABC
BAC
B
CD
ABCD AB CD
Sim2
Sim1
Viz1 A1
![Page 14: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/14.jpg)
Transformation Rules
dataset { var p struct { a, b, c float32 }}
view default { var p = p
}
view viz { var pa { a } = p var pba { b, a } = p
}
default
viz
T 0004pba
T 0000pa S 0000
S 0008
S 0004
T 0000p
S 0000
S 0004
S 0008
field a
field b
field a
field a
field b
field c
destdest
dest
dest
dest
dest
dest
![Page 15: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/15.jpg)
DReplFS -- Parser, Replication Engine, File Server in Go
KDreplFS -- Parser in Go, Replication Engine and File Server in the Linux kernel
Implementation
![Page 16: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/16.jpg)
ExperimentsDataset
const N = 176160768 type Data struct {
a, b, c float32 } var data [N]Data
Views
array of structs (AOS)
struct of arrays (SOA)
partial (only b)
Replicas
three replicas (AOS, SOA, b)
two replicas (AOS, b)
one replica (AOS)
Each replica on separate SSD
File Servers
pass-through (POSIX)
kdreplfs
![Page 17: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/17.jpg)
Results:Read
0
200
400
600
800
1000
Array of Structs
3 Replicas
Struct of Arrays
3 Replicas
Partial3 Replicas
Array of Structs
2 Replicas
Struct of Arrays
2 Replicas
Partial2 Replicas
Array of Structs
1 Replica
Struct of Arrays
1 Replica
Partial1 Replica
Band
wid
th (M
B/s)
kdreplfsPOSIX
![Page 18: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/18.jpg)
Results:Write
0
200
400
600
800
1000
Array of Structs
3 Replicas
Struct of Arrays
3 Replicas
Partial3 Replicas
Array of Structs
2 Replicas
Struct of Arrays
2 Replicas
Partial2 Replicas
Array of Structs
1 Replica
Struct of Arrays
1 Replica
Partial1 Replica
Band
wid
th (M
B/s)
kdreplfs-synckdreplfs-async
POSIX
![Page 19: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/19.jpg)
Results:Combined
0
200
400
600
800
1000
1200
1400
kdreplfs 1 replica kdreplfs 2 replicas POSIX
Band
wid
th (M
B/s)
ReadWrite
![Page 20: Optimizing Access to Application Data for Analysis …Syntax Similar to C, C++, Java Dataset define data types (structs, arrays) define named data of the types View(s) define substructs](https://reader033.vdocuments.mx/reader033/viewer/2022052612/5f102f167e708231d447dc09/html5/thumbnails/20.jpg)
Future Work
Variable-sized arrays
More array element orders (z-order, Hilbert)
Optimizations
Endianness for primary types
Support for HDF5 replicas
Implementation that doesn’t use file servers
Automatic generation of dataset definition from standard data formats (HDF5, NetCDF)