mining the “virtual sky” for qsos and agns university of napoli federico ii department of...

Mining the “Virtual Sky” for QSOs and AGNs

University of Napoli Federico II Department of Physical Sciences

G. Longo & the VONeural team

M. Brescia, R. D’Abrusco, S. Cavuoti, C. Donalek (Caltech), G. D’Angelo, E. de Filippis, N. Deniskina, M. Garofalo, A. Nocella, M. Paolillo, B. Skordovski

Purpose of the talk: to convince you that Statistical Data Mining is intrinsically suited for the GRID• In all methods, convergence is achieved through iterations (multiple

instances) of codes• programs must run close to the data• Training of DM algorithms is performed once (no real time constrains) even

for real time applications (queueing is not a problem)

Any astronomical observation can be regarded as a point in ℝ D

Almost all astronomical interpretative work falls into one of the above DM categories performed in ℝ D:• Clustering (Es. QSO/AGN search, FOF,

cluster detection, S/G separation, etc.)• Correlations (Es. fundamental plane,

ETRS, etc.)• Likelihood, Bayesian

RA Dec

Polarization

Wavelength

Morphology

FluxNot E.M.

Time

ProperMotion

Astronomical D-dim parameter space

DM algorithms scale very badly:• Clustering ~ N log(N) x N2, ~ D2

• Correlations ~ N log(N) x N2, ~ Dk (k ≥ 1)• Likelihood, Bayesian ~ Nm (m ≥ 3), ~ Dk (k ≥ 1)

FURTHERMORE ….

Typical DM algorithms are application independent• Dimensionality reduction (PCA, non linear MLP’s, etc.)

• Clustering & pattern recognition in multiparametric spaces (PPS, SVM-C, SOM, etc.)

• Classification of patterns (Classification trees, MLP, SVM, etc.)

• Regression (linear, non linear, etc.)

Parameter set n. 1

Clusterization 1

Clustering method 1

DATA Clustering method 2

Clustering method N

Parameter set n. 2

Parameter set n. n

Clusterization 2

Clusterization p

1. Pre-clustering algorithm: this phase can be accomplished performing a reduction of dimension of the feature space; this reduction via feature extraction/selection can be supervised or unsupervised (our choice in unsupervised).

2. Agglomerative clustering: different distance definitions and linkage models (simple, average, complete, Wards, etc.) need to be provided to perform clustering.

Clustering is usually performed on single objects, but this approach may be too sensitive to single outliers to be extensively used in highly non linear parameter space as astronomical ones. … hence …CORRECT APPROACH

N >108, D>>100, K>10 GRID

HPC

GRID + HPC

GRIDRixon’s talk

all what follows is already run on the SCOPE-GRID from VOTech infrastructure (ASTROGRID workbench and Taverna) using our GRID-LauncherGRID-launcher can be used to launch on the GRID any non-interactive application registered in the CEC and we are working on a version which can be used to launch also most user’s application (cf. tmw. N. Deniskina’s talk)

LAN INFN

1 Gbps

1 Gbps1 Gbps

1 Gbps

2 x1Gbps

100/200 Mbps

10 Gbps

1 Gbps

10 Gbps

1 Gbps

N x1 Gbps

2.5 GbpsCampus Grid

GARR

MAN UNINA

Siti Campus Grid

Tier 2Pon UniNA

LAN INFN

1 Gbps

1 Gbps1 Gbps

1 Gbps

2 x1Gbps

100/200 Mbps

10 Gbps

1 Gbps

10 Gbps

1 Gbps

N x1 Gbps

2.5 GbpsCampus Grid

GARR

MAN UNINA

Siti Campus Grid

Tier 2Pon UniNA

New HW: 250 four-processors CPU’sXeon Quad-core EM64T Clovertown E5320, 8 GB RAMStorage: t.d. 80 TB + 40 TB astronomy only t.b.e. to 240 TBNetwork: infiniband full redundant, 12 PS 12 Gigabit, 2 gateway fiber channel-infiniband with 2 gates FC4 - 8 gigabit. 230 nodes in infiniband; 20 blade forfibre channel connectivity.

Preexistence ca. 1 k-CPU’s

Campus GRID AstrogridSM LHC Tier 2 (INFN)

GRID-SCOPE one of the nodes of South Italy GRID infrastructure (see F. Pasian & U. Becciani’s talks)

http://scopegridice01.dsf.unina.it/site/site.php




Astrophysics in GRID-SCOPE• Astroparticles & Cosmology

• Simulations of CR and n induced showers – F. Guarino• Simulations of NEMO data - G. Barbarino• Simulations of Gravitational Waves from various astrophysical sources &

VIRGO project Data Analysis – L. Milano• Simulations of cosmic string signatures on CMB – G. Longo• Primordial Nucleosynthesis – G. Miele

• VO related activity• Pipelines for survey data (2dphot) – de Carvalho, Grado, La Barbera, Longo• Image segmentation – O. Laurino• Supervised and unsupervised & Hybrid data mining tools• Science cases within VO framework

Data Mining applications

• Astronomy

• Astroparticle physics (AUGER)

• Bioinformatics (gene sequencing)

• Finance (stock market trend analysis)

• Seismology & vulcanic risk

We live here

Photometric redshiftsD’Abrusco et al. 2007, ApJ,

QSO and AGN search & classificationD’Abrusco, Longo, Walton, 2008, MNRAS in press

AGN classification in Photometric spacesCavuoti, D’Abrusco, D’Angelo, Longo, 2008, MNRAS, submitted

Science Cases

• Supervised classification

• Unsupervised clustering

• Supervised clustering

Full list of papers (technical, and in other disciplines) available at the VONeural project’s site http://people.na.infn.it/ ~ astroneural/

1. Optical: sample derived from SDSS database table “Target” queried for QSO candidates, containing 1.11 10∼ ⋅ 5 records and 5.8 10∼ ⋅ 4 confirmed QSO (‘specClass == 3 OR specClass == 4’).

2. Optical + NIR: sample derived from positional matching (‘best’) between SDSS-DR3 database view “Star” queried for all objects with spectroscopic follow-up available and detection in all 5 bands (u,g,r,i,z) with high reliability for redshift estimation and line-fitting classification (‘specClass’) and high S/N photometry, and UKIDSS-DR1 star-like (‘mergedClass == -1’) objects fully detected in each of the four lasSurvey bands (Y,J,H,K) and clean photometry. This sample is formed by 2192 objects.

Application to selection of QSO candidates

Quasar Candidates – Photometric SelectionTraditional way to look for candidate QSO in 3 band survey

Candidate QSOsfor spectroscopic follow-up’s

errors Cutoff line

Confusion region

In 4 bands , the degeneracyis partially removed

The more are the bandsthe lower is the degeneracy

Higher dimensionality spaces are more effective but much more difficult to handle….

‣QSOs are supposed to be >4σ far from a hyper-cylindrical region containing the “stellar locus” (S.L.), where σ depends on photometric errors.

‣ QSOs are supposed to be placed inside the inclusion regions, even if not matching the previous requirement.

c = 95%, e = 65% locally less

SDSS QSOs targeting algorithm

SDSS QSO candidate selection algorithm (Richards et al, 2002); SDSS colours space (u-g,g-r,r-i,i-z)

Parameter subspace

SDSS UKIDSS

Base of Knowledge

Selection of best clustering

Agglomerativeclustering

Unsupervised clustering

VO DS 1

VO DS n

…..

1. Unsupervised clustering (PPS) determines a large number of distinct groups of objects: nearby clusters in the colours space are mapped onto the surface of a sphere.

2. agglomerative algorithm (NEC) aggregates clusters from PPS to a (a-priori unknown) number of final clusters.

3. These clusters are examined and “interesting” ones are selected through the Base of knowledge.

VO approach (Taverna based)

My Space

Two free parameters to be set are the number of latent variables for PPS (“resolution” of the initial clustering) and the critical value(s) of dissimilarity threshold Dth for NEC (through Plateau analysis or Dendrogram analysis)

Once partition is completed (as a function of Dth), clusters mainly populated by QSO (according the knowledge-base at our disposal) are selected and informations about these clusters are exploited for the candidate QSO selection.

To determine the critical dissimilarity Dth threshold we rely not only on a stability requirement. Given the following definition:

The process is recursive: feeding merged unsuccessful clusters in the clustering pipeline until no other successful clusters are found. The overall efficiency of the process etot is the sum of weighed efficiencies ei for each generation:

Algorithm is finalised at maximizing the Normalised Success Ratio (NSR):

[ ]cluster is “successful” [ ]its fraction of confirmed QSO is higher then a fixed value

VONeural_NEC

NegE(A ∪ B) < Dth (Dth constant)

Clustering method based on an approximation of “negative entropy”, a generalised measure of non-gaussianity of a distribution. For each couple of contiguous (after linear Fisher’s discriminant) clusters A and B in the sample, relation:

is checked. When true, A and B are replaced by C = A ∪ B.

VONeural_PPSProbabilistic Principal Surfaces are a non-linear extension of principal components which defines a non linear parametric mapping from a Q-dimensional to D-dimensional space (usually Q << D) (Chang, 2001) usually called “latent space”. Clusters in parameter space are produced from the scratch (from points distribution).

Algorithms used for QSOs selection

NSR

Choice of the “best” clustering

QSOs not QSOs

QSOs 759 72

not QSOs

83 1327

labels

classification

c = 89.6 % e = 83.4 %

Experiment 2: SDSS ∩ UKIDSSu - g vs g - r r - J vs J - K

Only a fraction (43%) of these objects was selected as candidate QSO’s by the SDSS targeting algorithm in first instance (remaining sources have been included in the spectroscopic program because they were serendipitously selected in other spectroscopic programmes (mainly stars)).

Experiment 2: efficiency

Experiment 2: completeness

Sample Parameters Labels etot ctot ngen nsuc_clus

Optical QSO

candidates(1)

SDSS colours ‘specClass’ 83.4 %(± 0.3 %)

89.6 %(± 0.6 %)

2(3,0)

Optical + NIR star-

like objects(2)

SDSS colours + UKIDSS colours

‘specClass’ 91.3 %(± 0.5 %)

90.8 %(± 0.5 %)

3(3,1,0)

Optical + NIR star-

like objects(3)

SDSS colours ‘specClass’ 92.6 %(± 0.4 %)

91.4 %(± 0.6 %)

3(3,0,1)

RESULTS

II Case: Support Vector Machines and AGN classification

SVMs map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. The “kernel function” of the SVM in the “C-Support Vector Classification” implementation (Boser et al. 1992), depends on two parameters, one in the model (C) and the other in the “kernel function” (γ):

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression.

SDSS data used for the BoK

I. AGN’s catalogue (Sorrentino et al, 2006)

• 0.05 < z < 0.095• Mr < -20.0• AGN’s selected according to Kewley’s empirical method (Kewley et al. 2001):

II. Emission lines ratio catalogue (Kauffman et al, 2003)

The BoK is formed by objects residing in different regions of the BPT plot (Baldwin, Phillips and Tellevich 1981).

Kewley’s line

Kauffman’s line

Heckman’s line

Seyfert I: galaxies for which these relations are satisfied:(FWHM(Hα) > 1.5*FWHM([OIII] λ5007) OR FWHM(Hα) > 1200 Km s-1 ) AND FWHM([OIII] λ5007) < 800 Km s-1

Seyfert II: all remaining galaxies.

The Base of Knowledge

Kewley’s line

Heckman’s line

Kauffman’s line

log(NII)/Hα

log

(OII

I)/H

β

Photometric parameters used for training of the NNs and SVMs:

petroR50_u, petroR50_g, petroR50_r, petroR50_i, petroR50_zconcentration_index_r, fibermag_r(u – g)dered, (g – r)dered, (r – i)dered, (i – z) dered

dered_r, photo_z_corr

Target values:

1° Experiment: AGN -> 1, Mixed -> 02° Experiment: Type 1 -> 1, Type 2 -> 03° Experiment: Seyfert -> 1, LINERs -> 0

Photometric parameters

PhotoObjAll table (SDSS-DR5)

Photometric redshifts catalogue based on SDSS-DR5 catalogues (D’Abrusco et al., 2007)

1

010

10

SVM parameter space exploration strategy

The sampling strategy of the 2-dim parameter plane (γ, C) for the SVM (proposed by Hsu, Chang et Lin) consists in running different jobs on a grid whose knots are spaced by a factor 4 on both parameters (γ = 2−15, 2−13…23, C = 2−5, 2−3, ...215).

Cross-validation of results and “folding” (5 subsets) of the dataset are used for all experiments.The SVM experiments for different couples of values of the parameters (γ, C) have been run on a 120 knots grid infrastructure of the SCOPE/COMETA/CYBERSAR Virtual Organization. (NA, CT, CA)

6 days

8 min

6 hrs

11hrs

Need for HPCGRID domain

N=107, BoK = 105

Results (II)

Sample Parameters BoK Algorithm etot ctot

Experiment (1)

SDSS photometric parameters + photo redshift

BPT plot +Kewley’s

line

Experiment (2)


BPT plot+Kewley’

s line

Experiment (3)


BPT plot+Heckman’s+Kewley’s

lines

Cavuoti, D’Abrusco, Longo, 2008, in preparation.

MLP

MLP

MLP

SVM

SVM

SVM

~76% ~54%

~74% ~55%

etyp1~95%etyp2~98%

etyp1~82%etyp2~86%

~100%

~98%

~80%

~78%

~92%

~89%

Some of the ongoing work1. Improving the accuracy of the photometric parameters deriving them

directly by the images (in coll. with R. De Carvalho and F. La Barbera).

• Porting 2Dphot pipeline on the GRID to reprocess 73.000 SDSS images ( R. de Carvalho, F. La Barbera, VONeural, DONE)

• Integration of 2Dphot with VO_Extractor (in progress)

• Implementation of AGN-Spec (VO-tool for automatic analysis of AGN spectra, in coll with P. Benvenuti, P. Rafanelli et al. just started

2. Parallelization of algorithms (P-SVM and P-PPS) in progress

3. Automatic extraction of BoKs from VO data sets (algorithm development stage)

Conclusions• For most DM applications GRID is better than HPC• Integration of VObs and GRID can be done and works & …..

Integration to Guy’s talk• Euro-VO can buy an airline almost for free (Alitalia)

mining the “virtual sky” for qsos and agns university of napoli federico ii department of...

Documents

clustering es

agglomerative clustering

clustering method nparameter

clustering pattern recognition

linear parameter space

astronomical ones

linear mlps

regression linear