Toledo, 2006-02-25 Databases @ MPA
Databases@MPA, access methods and plans
With contributions from • JHU : Alex Szalay, Jan Vanderberg• MPA: Jeremy Blaizot, Jarle Brinchmann, Guinevere Kauffmann, Anja von der Linden, Ben Panter, Guo Qi, Volker Springel, Vivienne Wild
Toledo, 2006-02-25 Databases @ MPA
Last year, Budapest
• Presented milli-Millennium halo merger tree database
• Requests:– More properties (lambda, ...) X– Galaxies V– Correlation with environment (galaxies in voids) V– Millennium
• Why use databases ? Ask Alex.
Toledo, 2006-02-25 Databases @ MPA
Current status
• milli-Millennium– Galaxies added: merger trees, links to their parent halos– Density field at various smoothings– Updated web site (demo)
• Millennium subset– Subset (~2%, 10x milli-Mil) of halo and galaxy trees– Z=0 density field
• Millennium– Halo trees in database (proprietary)– SAM galaxies under way (settle on model etc)– Density fields at all Z will be added: 1056964608 rows
• Durham – milli_Millennium mirror (Postgres)– Durham halo tree and galaxy catalogues
Toledo, 2006-02-25 Databases @ MPA
Other databases
• ROSAT: source catalogues and RASS photons (~100 million)
• SDSS Peripherals– SDSS_MPA (Brinchman, Kauffmann, Tremonti et al)– MOPED (Ben Panter)– SDSS_PCA (Vivienne Wild et al)
• GalICS (Jeremy Blaizot)• HEALPix all sky maps (Alex Szalay, Tony Banday)
– wmap (3 year data soon !)– extinction maps– radio maps (Bonn)– ROSAT background (hopefully)
Toledo, 2006-02-25 Databases @ MPA
Access
• Public: http://www.g-vo.org/mpasims• Local web apps to Millennium, BESTDR3 and
peripherals: http://www.g-vo.org/sdssdr3/• Public web browser queries limited (1min,
10000 rows)• Local databases + web apps less limited
Toledo, 2006-02-25 Databases @ MPA
Streaming
• Query results temporarily buffered on server: memory
• Streaming queries: faster, less limited (only timeout)
• Access:– IDL (with Ben Panter)
• wget –http-user=*** --http-password=*** -O localfile.csv http://www.g-vo.org/sdssdr3/DBQueryStream?SQL=select * from moped..agebin
• GUI asking for username/password• Interprets CSV stream, turned into IDL components
– TOPCAT
Toledo, 2006-02-25 Databases @ MPA
Plans: Millennium
• Millennium:– Tune database
• 750000000 halos• N x 1000000000 galaxies• 63 x 256^3 density field grid cells
– More halo properties (shape, λ, ...)– More galaxy catalogues
• different parameters • different algorithms (GalICS, Durham, ...)
– Light cone mock catalogues– Galaxy spectra (+ PCA)– Links to SDSS mirror and peripherals– Proper metadata handling (ala SkyServer)– "SAM online„– Move webapps to MPA– Use JHU services, install CAS jobs
Toledo, 2006-02-25 Databases @ MPA
Plans: SDSS mirror + peripherals
• Make mirror web site public• Upgrade SDSS mirror to DR4 …• Stabilize, document, publish SDSS
peripherals• Proper metadata handling• Links to Millennium• Personal databases: MyDB (ala SkyServer)
• Add logos
Toledo, 2006-02-25 Databases @ MPA
Theory VO: spectra
• Combine theory and observations• Example: query-by-example on theory
spectra• Find similar spectra, from these the actual
galaxy formation history• Chi-squared on all stored spectra ? Slow,
requires storing all of them• Idea (not original, see HVO/JHU talks): use
PCA to compress data
Toledo, 2006-02-25 Databases @ MPA
PCA
• Need training sample of theory spectra to create eigenspectra
• Project all spectra • Store PCA amplitudes in DB• Provide web service:
– Upload (observational) spectrum (IVOA SSA/SED)– Project onto theory eigenspectra– Use amplitudes as parameters in query for
“nearby” amplitudes– Return corresponding theory spectra– Return corresponding galaxy formation histories,
or their halos, or their environment …
Toledo, 2006-02-25 Databases @ MPA
Issues
• Dealing with errors, gaps: “gappy PCA” (Connolly & Szalay)
• Normalization: – incoming spectrum in general from very different
dataset, needs common normalization – Incoming set will have gaps, errors– Ad hoc normalization possible (and works quite
good)
• Indexing of complex multi-dimensional point set for quick nearest k neigbours search (Voronoi ? See Laszlo‘s work)
Toledo, 2006-02-25 Databases @ MPA
Normalized gappy PCA
• Fit normalization factor at same time as PCA amplitudes. Model:
• Minimize (over ai and N ) :
Toledo, 2006-02-25 Databases @ MPA
So far
• Ran PCA on BC03 stochastic bursts (Vivienne)
• On first GalICS+milli-Millennium spectra (Jeremy)
• Projected SDSS spectra on both• Defined a PCA data model/schema• Stored PCAs in database• TOPCAT
Toledo, 2006-02-25 Databases @ MPA
PCA data model (RDB schema available) PCADecompositionAlgorithm
SpectrumCatalogue
-redshift-target
Spectrum
-lambda-bin-flux-error
PhotometryPoint
-spectrum*
-spectra
*
-restRedshift : double
PCARun
*
-algorithm
1
*
-catalogue
1
-assumedRedshift : double-featureMask
PCASpectrum-inputSpectra
* *-spectrum1
-pcaRank : int
PCAEigenSpectrum
-eigenSpectra
*
-lambda-mean-variance-wavelengthMask
PCAPreProcessing
-preprocessing
*
PCAProjectionRun
-normalization : double-redshiftShift : double-amplitudes : double
PCAAmplitudes
*
-spectrum
1
-amplitudes
*
*
-pcaDecomposition
1
PCAProjectionAlgrithm
*
-algorithm1
Toledo, 2006-02-25 Databases @ MPA
Issues for query-by-example
• Overlap quite good, but good enough ?• GalICS spread less than SDSS. • BC03 comparable with SDSS, but different slope.• Systematics
– Model: • physics very preliminary (see Blaizot & de Lucia?)• resolution effects
– Preprocessing SDSS galaxies • Rebinning: different algorithms give comparable results• (slightly) wrong redshift ? Can be easily simulated
– Projection algorithm: normalization does not affect outcome– Observational systematics: use virtual telescope (+virtual
spectrograph) to test on the theory spectra.Easier to blow up simulation than to shrink observation cloud