mining the “virtual sky” for qsos and agns university of napoli federico ii department of...
TRANSCRIPT
Mining the “Virtual Sky” for QSOs and AGNs
University of Napoli Federico II Department of Physical Sciences
G. Longo & the VONeural team
M. Brescia, R. D’Abrusco, S. Cavuoti, C. Donalek (Caltech), G. D’Angelo, E. de Filippis, N. Deniskina, M. Garofalo, A. Nocella, M. Paolillo, B. Skordovski
Purpose of the talk: to convince you that Statistical Data Mining is intrinsically suited for the GRID• In all methods, convergence is achieved through iterations (multiple
instances) of codes• programs must run close to the data• Training of DM algorithms is performed once (no real time constrains) even
for real time applications (queueing is not a problem)
Any astronomical observation can be regarded as a point in ℝ D
Almost all astronomical interpretative work falls into one of the above DM categories performed in ℝ D:• Clustering (Es. QSO/AGN search, FOF,
cluster detection, S/G separation, etc.)• Correlations (Es. fundamental plane,
ETRS, etc.)• Likelihood, Bayesian
RA Dec
Polarization
Wavelength
Morphology
FluxNot E.M.
Time
ProperMotion
Astronomical D-dim parameter space
DM algorithms scale very badly:• Clustering ~ N log(N) x N2, ~ D2
• Correlations ~ N log(N) x N2, ~ Dk (k ≥ 1)• Likelihood, Bayesian ~ Nm (m ≥ 3), ~ Dk (k ≥ 1)
FURTHERMORE ….
Typical DM algorithms are application independent• Dimensionality reduction (PCA, non linear MLP’s, etc.)
• Clustering & pattern recognition in multiparametric spaces (PPS, SVM-C, SOM, etc.)
• Classification of patterns (Classification trees, MLP, SVM, etc.)
• Regression (linear, non linear, etc.)
Parameter set n. 1
Clusterization 1
Clustering method 1
DATA Clustering method 2
Clustering method N
Parameter set n. 2
Parameter set n. n
Clusterization 2
Clusterization p
1. Pre-clustering algorithm: this phase can be accomplished performing a reduction of dimension of the feature space; this reduction via feature extraction/selection can be supervised or unsupervised (our choice in unsupervised).
2. Agglomerative clustering: different distance definitions and linkage models (simple, average, complete, Wards, etc.) need to be provided to perform clustering.
Clustering is usually performed on single objects, but this approach may be too sensitive to single outliers to be extensively used in highly non linear parameter space as astronomical ones. … hence …CORRECT APPROACH
N >108, D>>100, K>10 GRID
HPC
GRID + HPC
GRIDRixon’s talk
all what follows is already run on the SCOPE-GRID from VOTech infrastructure (ASTROGRID workbench and Taverna) using our GRID-LauncherGRID-launcher can be used to launch on the GRID any non-interactive application registered in the CEC and we are working on a version which can be used to launch also most user’s application (cf. tmw. N. Deniskina’s talk)
LAN INFN
1 Gbps
1 Gbps1 Gbps
1 Gbps
2 x1Gbps
100/200 Mbps
10 Gbps
1 Gbps
10 Gbps
1 Gbps
N x1 Gbps
2.5 GbpsCampus Grid
GARR
MAN UNINA
Siti Campus Grid
Tier 2Pon UniNA
LAN INFN
1 Gbps
1 Gbps1 Gbps
1 Gbps
2 x1Gbps
100/200 Mbps
10 Gbps
1 Gbps
10 Gbps
1 Gbps
N x1 Gbps
2.5 GbpsCampus Grid
GARR
MAN UNINA
Siti Campus Grid
Tier 2Pon UniNA
New HW: 250 four-processors CPU’sXeon Quad-core EM64T Clovertown E5320, 8 GB RAMStorage: t.d. 80 TB + 40 TB astronomy only t.b.e. to 240 TBNetwork: infiniband full redundant, 12 PS 12 Gigabit, 2 gateway fiber channel-infiniband with 2 gates FC4 - 8 gigabit. 230 nodes in infiniband; 20 blade forfibre channel connectivity.
Preexistence ca. 1 k-CPU’s
Campus GRID AstrogridSM LHC Tier 2 (INFN)
GRID-SCOPE one of the nodes of South Italy GRID infrastructure (see F. Pasian & U. Becciani’s talks)
http://scopegridice01.dsf.unina.it/site/site.php
Astrophysics in GRID-SCOPE• Astroparticles & Cosmology
• Simulations of CR and n induced showers – F. Guarino• Simulations of NEMO data - G. Barbarino• Simulations of Gravitational Waves from various astrophysical sources &
VIRGO project Data Analysis – L. Milano• Simulations of cosmic string signatures on CMB – G. Longo• Primordial Nucleosynthesis – G. Miele
• VO related activity• Pipelines for survey data (2dphot) – de Carvalho, Grado, La Barbera, Longo• Image segmentation – O. Laurino• Supervised and unsupervised & Hybrid data mining tools• Science cases within VO framework
Data Mining applications
• Astronomy
• Astroparticle physics (AUGER)
• Bioinformatics (gene sequencing)
• Finance (stock market trend analysis)
• Seismology & vulcanic risk
We live here
Photometric redshiftsD’Abrusco et al. 2007, ApJ,
QSO and AGN search & classificationD’Abrusco, Longo, Walton, 2008, MNRAS in press
AGN classification in Photometric spacesCavuoti, D’Abrusco, D’Angelo, Longo, 2008, MNRAS, submitted
Science Cases
• Supervised classification
• Unsupervised clustering
• Supervised clustering
Full list of papers (technical, and in other disciplines) available at the VONeural project’s site http://people.na.infn.it/ ~ astroneural/
1. Optical: sample derived from SDSS database table “Target” queried for QSO candidates, containing 1.11 10∼ ⋅ 5 records and 5.8 10∼ ⋅ 4 confirmed QSO (‘specClass == 3 OR specClass == 4’).
2. Optical + NIR: sample derived from positional matching (‘best’) between SDSS-DR3 database view “Star” queried for all objects with spectroscopic follow-up available and detection in all 5 bands (u,g,r,i,z) with high reliability for redshift estimation and line-fitting classification (‘specClass’) and high S/N photometry, and UKIDSS-DR1 star-like (‘mergedClass == -1’) objects fully detected in each of the four lasSurvey bands (Y,J,H,K) and clean photometry. This sample is formed by 2192 objects.
Application to selection of QSO candidates
Quasar Candidates – Photometric SelectionTraditional way to look for candidate QSO in 3 band survey
Candidate QSOsfor spectroscopic follow-up’s
errors Cutoff line
Confusion region
In 4 bands , the degeneracyis partially removed
The more are the bandsthe lower is the degeneracy
Higher dimensionality spaces are more effective but much more difficult to handle….
‣QSOs are supposed to be >4σ far from a hyper-cylindrical region containing the “stellar locus” (S.L.), where σ depends on photometric errors.
‣ QSOs are supposed to be placed inside the inclusion regions, even if not matching the previous requirement.
c = 95%, e = 65% locally less
SDSS QSOs targeting algorithm
SDSS QSO candidate selection algorithm (Richards et al, 2002); SDSS colours space (u-g,g-r,r-i,i-z)
Parameter subspace
SDSS UKIDSS
Base of Knowledge
Selection of best clustering
Agglomerativeclustering
Unsupervised clustering
VO DS 1
VO DS n
…..
1. Unsupervised clustering (PPS) determines a large number of distinct groups of objects: nearby clusters in the colours space are mapped onto the surface of a sphere.
2. agglomerative algorithm (NEC) aggregates clusters from PPS to a (a-priori unknown) number of final clusters.
3. These clusters are examined and “interesting” ones are selected through the Base of knowledge.
VO approach (Taverna based)
My Space
Two free parameters to be set are the number of latent variables for PPS (“resolution” of the initial clustering) and the critical value(s) of dissimilarity threshold Dth for NEC (through Plateau analysis or Dendrogram analysis)
Once partition is completed (as a function of Dth), clusters mainly populated by QSO (according the knowledge-base at our disposal) are selected and informations about these clusters are exploited for the candidate QSO selection.
To determine the critical dissimilarity Dth threshold we rely not only on a stability requirement. Given the following definition:
The process is recursive: feeding merged unsuccessful clusters in the clustering pipeline until no other successful clusters are found. The overall efficiency of the process etot is the sum of weighed efficiencies ei for each generation:
Algorithm is finalised at maximizing the Normalised Success Ratio (NSR):
[ ]cluster is “successful” [ ]its fraction of confirmed QSO is higher then a fixed value
VONeural_NEC
NegE(A ∪ B) < Dth (Dth constant)
Clustering method based on an approximation of “negative entropy”, a generalised measure of non-gaussianity of a distribution. For each couple of contiguous (after linear Fisher’s discriminant) clusters A and B in the sample, relation:
is checked. When true, A and B are replaced by C = A ∪ B.
VONeural_PPSProbabilistic Principal Surfaces are a non-linear extension of principal components which defines a non linear parametric mapping from a Q-dimensional to D-dimensional space (usually Q << D) (Chang, 2001) usually called “latent space”. Clusters in parameter space are produced from the scratch (from points distribution).
Algorithms used for QSOs selection
NSR
Choice of the “best” clustering
QSOs not QSOs
QSOs 759 72
not QSOs
83 1327
labels
classification
c = 89.6 % e = 83.4 %
Experiment 2: SDSS ∩ UKIDSSu - g vs g - r r - J vs J - K
Only a fraction (43%) of these objects was selected as candidate QSO’s by the SDSS targeting algorithm in first instance (remaining sources have been included in the spectroscopic program because they were serendipitously selected in other spectroscopic programmes (mainly stars)).
Experiment 2: efficiency
Experiment 2: completeness
Sample Parameters Labels etot ctot ngen nsuc_clus
Optical QSO
candidates(1)
SDSS colours ‘specClass’ 83.4 %(± 0.3 %)
89.6 %(± 0.6 %)
2(3,0)
Optical + NIR star-
like objects(2)
SDSS colours + UKIDSS colours
‘specClass’ 91.3 %(± 0.5 %)
90.8 %(± 0.5 %)
3(3,1,0)
Optical + NIR star-
like objects(3)
SDSS colours ‘specClass’ 92.6 %(± 0.4 %)
91.4 %(± 0.6 %)
3(3,0,1)
RESULTS
II Case: Support Vector Machines and AGN classification
SVMs map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. The “kernel function” of the SVM in the “C-Support Vector Classification” implementation (Boser et al. 1992), depends on two parameters, one in the model (C) and the other in the “kernel function” (γ):
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression.
SDSS data used for the BoK
I. AGN’s catalogue (Sorrentino et al, 2006)
• 0.05 < z < 0.095• Mr < -20.0• AGN’s selected according to Kewley’s empirical method (Kewley et al. 2001):
II. Emission lines ratio catalogue (Kauffman et al, 2003)
The BoK is formed by objects residing in different regions of the BPT plot (Baldwin, Phillips and Tellevich 1981).
Kewley’s line
Kauffman’s line
Heckman’s line
Seyfert I: galaxies for which these relations are satisfied:(FWHM(Hα) > 1.5*FWHM([OIII] λ5007) OR FWHM(Hα) > 1200 Km s-1 ) AND FWHM([OIII] λ5007) < 800 Km s-1
Seyfert II: all remaining galaxies.
The Base of Knowledge
Kewley’s line
Heckman’s line
Kauffman’s line
log(NII)/Hα
log
(OII
I)/H
β
Photometric parameters used for training of the NNs and SVMs:
petroR50_u, petroR50_g, petroR50_r, petroR50_i, petroR50_zconcentration_index_r, fibermag_r(u – g)dered, (g – r)dered, (r – i)dered, (i – z) dered
dered_r, photo_z_corr
Target values:
1° Experiment: AGN -> 1, Mixed -> 02° Experiment: Type 1 -> 1, Type 2 -> 03° Experiment: Seyfert -> 1, LINERs -> 0
Photometric parameters
PhotoObjAll table (SDSS-DR5)
Photometric redshifts catalogue based on SDSS-DR5 catalogues (D’Abrusco et al., 2007)
1
010
10
SVM parameter space exploration strategy
The sampling strategy of the 2-dim parameter plane (γ, C) for the SVM (proposed by Hsu, Chang et Lin) consists in running different jobs on a grid whose knots are spaced by a factor 4 on both parameters (γ = 2−15, 2−13…23, C = 2−5, 2−3, ...215).
Cross-validation of results and “folding” (5 subsets) of the dataset are used for all experiments.The SVM experiments for different couples of values of the parameters (γ, C) have been run on a 120 knots grid infrastructure of the SCOPE/COMETA/CYBERSAR Virtual Organization. (NA, CT, CA)
6 days
8 min
6 hrs
11hrs
Need for HPCGRID domain
N=107, BoK = 105
Results (II)
Sample Parameters BoK Algorithm etot ctot
Experiment (1)
SDSS photometric parameters + photo redshift
BPT plot +Kewley’s
line
Experiment (2)
SDSS photometric parameters + photo redshift
BPT plot+Kewley’
s line
Experiment (3)
SDSS photometric parameters + photo redshift
BPT plot+Heckman’s+Kewley’s
lines
Cavuoti, D’Abrusco, Longo, 2008, in preparation.
MLP
MLP
MLP
SVM
SVM
SVM
~76% ~54%
~74% ~55%
etyp1~95%etyp2~98%
etyp1~82%etyp2~86%
~100%
~98%
~80%
~78%
~92%
~89%
Some of the ongoing work1. Improving the accuracy of the photometric parameters deriving them
directly by the images (in coll. with R. De Carvalho and F. La Barbera).
• Porting 2Dphot pipeline on the GRID to reprocess 73.000 SDSS images ( R. de Carvalho, F. La Barbera, VONeural, DONE)
• Integration of 2Dphot with VO_Extractor (in progress)
• Implementation of AGN-Spec (VO-tool for automatic analysis of AGN spectra, in coll with P. Benvenuti, P. Rafanelli et al. just started
2. Parallelization of algorithms (P-SVM and P-PPS) in progress
3. Automatic extraction of BoKs from VO data sets (algorithm development stage)
Conclusions• For most DM applications GRID is better than HPC• Integration of VObs and GRID can be done and works & …..
Integration to Guy’s talk• Euro-VO can buy an airline almost for free (Alitalia)