(laboratoriodi) sistemi)informaci )avanza...

(Laboratorio di ) Sistemi Informa3ci Avanza3

Giuseppe Manco

SEARCH

Approcci alle re3 di grandi dimensioni

Heavy-‐tails e power laws (su scale di grandi imensioni): •  forte eterogeneità locale, mancanza di struEura •  base per i modelli preferen3al aEachment

Local clustering/structure (su scale di piccole dimensioni): •  situazioni locali hanno una struEura “geometrica”

•  punto di partenza per modelli small world che partono con una “geometria” globale e aggiungono link random per oEenere un diametro piccolo e preservare la geometria a livello locale

Le problema3che di interesse

•  Quali sono le sta3s3che di base (degree distribu3ons, clustering coefficients, diametro, etc.)?

•  Ci sono raggruppamen3/par3zioni naturali?

•  Come evolvono/rispondono alle perturbazioni le re3?

•  Come avvengono I processi dinmaici -‐ search, diffusion, etc. – nelle re3?

•  Come fare classificazione, regressione, ranking, etc.?

Osservazioni sulle re3 reali

•  Diametro – Costante

•  Coefficiente di clustering – Costante

•  Degree distribu3on – Power-‐law

Applicazioni: Search

•  Small world – È possibile navigare la rete

•  Preferen3al aEachment – Ci sono alcuni grossi hubs

•  Come sfruEare tali informazioni?

Singular Value Decomposi3on

•  Tecnica di decomposizione matriciale •  Basata sull’analisi speErale •  Tante applicazioni

La Singular Value Decomposi3on (SVD)

•  Data una matrice A, m x n, essa può essere decomposta come prodoEo di tre matrici:

•  p: rango di A •  U, V: matrici ortogonali (UTU=Im, VTV=In) contenen3

rispe_vamente i veEori singolari destri e sinistri di A

•  ∑: matrice diagonale contenente i valori singolari di A, in ordine non-‐crescente σ1≥σ2≥... ≥σp ≥0

Interpretazione a Layer della SVD

= u1vT1 + u1vT1 +... σ1 σ2

Importanza decrescente

VeEori Singolari, Intuizione

I cerchi blu rappresentano m pun3 nello spazio euclideo. La SVD della matrice mx2 sarà cos3tuita da: -‐ Primo veEore singolare (destro): direzione della varianza max -‐ Secondo veEore singolare (destro): direzione della max varianza dopo aver rimosso la proiezione dei da3 lungo il primo veEore singolare

VeEori Singolari, Intuizione

•  σ1: misura quanta varianza dei da3 è “caEurata/spiegata” dal primo veEore singolare

•  σ2: misura quanta varianza dei da3 è “caEurata/sp iegata” da l secondo veEore singolare

Low Rank Approxima3on

•  Si tronca la SVD ai primi k termini:

•  k= rango della decomposizione

•  Uk, Vk: matrici ortogonali contenen3 rispe_vamente i primi k veEori singolari destri e sinistri di A

•  ∑k: matrice diagonale contenente i primi valori k singolari di A

Proprietà

•  Anche per matrici con da3 posi3vi, la SVD è mista in segno

+ +/-‐ +/-‐

•  U e V sono dense •  Unicità: nonostante ci siano diversi algoritmi, ques3

producono la stessa SVD (A troncata)

•  Proprietà: mantenere i primi k valori singolari di A fornisce la migliore rank-‐k approxima3on di A rispeEo alla Frobenius norm

Low Rank Approxima3on •  Usa Ak al posto di A

Umm Ukm ∑ mxn

∑kk VT nxn

VT kxn nxn

Sommario della Truncated SVD •  Pro:

–  Usare Ak al posto di A implica un aumento delle performance generale degli algoritmi di mining

–  la riduzione del rumore isola le componen3 essenziali della matrice da3

–  Best rank-‐k approxima3on –  Ak è unica e o_ma secondo la Frobenious norm

•  Contro: –  Storage (Uk e Vk sono dense) –  L’interpretazione di U e V è difficile perchè hanno segno misto

–  Un buon punto di troncamento k è difficile da determinare

Applicazioni della SVD all’analisi dei da3

•  Dimensionality reduc3on: la truncated SVD fornisce una rappresentazione compressa di da3 ad alta dimensionalità (con mol3 aEribu3).

•  La compressione SVD minimizza la perdita di informazione, misurata secondo la Frobenious norm

•  Se i da3 originali contengono rumore, la riduzione di dimensionalità può essere considerata come una tecnica di aEenuazione del rumore

•  Se fissiamo k=2 o k=3, allora è possibile ploEare le righe di U. La rappresentazione grafica rende possibile un’interpretazione visuale della struEura del dataset

SVD e Latent Seman3c Indexing

KDD'09 Faloutsos, Miller, Tsourakakis P1-23

CMU SCS

SVD - Example

• A = U ΛΛΛΛ VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

datainf.

retrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CMU SCS

SVD - Example

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

datainf.

retrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CS-conceptMD-concept

CMU SCS

SVD - Example

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

datainf.

retrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

Affinità documento-‐conceEo

Importanza del conceEo

CMU SCS

SVD - Example

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

datainf.

retrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CMU SCS

SVD - Example

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

datainf.

retrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

Affinità termine-‐conceEo

Riduzione di dimensionalità

CMU SCS

SVD - Interpretation #2

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CMU SCS

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

variance (‘spread’) on the v1 axisVarianza lungo l’asse v1

•  Eliminiamo elemen3 a bassa varianza

CMU SCS

• More details

• Q: how exactly is dim. reduction done?

• A: set the smallest singular values to zero:1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

•  Eliminiamo gli elemen3 a bassa varianza

CMU SCS

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

~9.64 0

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CMU SCS

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

0.58 0.58 0.58 0 0

CMU SCS

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 0 0

Applicazioni della SVD all’analisi dei da3

•  Clustering: nello spazio della trasformazione SVD troncata, la relazioni tra i pun3 sono più eviden3 e il processo di clustering ne trae direEo vantaggio

•  Applicazioni al clustering: •  Clustering sul nuovo spazio •  U3lizzo direEo delle proprietà dell’SVD •  Spectral clustering: i pun3 che giacciono nel cono intorno al

primo asse (prodoEo con il primo asse <1/2) sono raggruppa3 in un cluster

•  Quelli con la stessa proprietà rispeEo al secondo asse vengono raggruppa3 nel secondo cluster e così via

Raggruppamen3, blocchi

CMU SCS

• finds non-zero ‘blobs’ in a data matrix

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CMU SCS

• finds non-zero ‘blobs’ in a data matrix

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.80

0 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CMU SCS

• finds non-zero ‘blobs’ in a data matrix =

• ‘communities’ (bi-partite cores, here)

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 3

0 0 0 1 1

Col 4Row 5

Applicazioni della SVD all’analisi dei da3 •  Ranking: •  Ogni riga di U può essere rappresentata come un punto nello spazio k-‐

dimensionale. Supponiamo di tracciare una freccia dall’origine verso ciascuno dei pun3

•  L’angolo (coseno) tra i due veEori denota la correlazione tra i pun3 •  Ogge_ altamente correla3 o altamente non correla3 con altri pun3

tendono a piazzarsi intorno all’origine

•  Pun3 colloca3 lontano dall’origine corrispondono ad ogge_ che esibiscono una correlazione inusuale con altri ogge_

•  Pun3 colloca3 vicino all’origine sono meno “interessan3”

•  Il rank degli ogge_ può essere effeEuato tenendo conto della distanza dall’origine

Proprietà (A)

UTU = ImVTV = InΛk = diag σ1

k,σ 2k,...,σ r

k( )AT =VΛUT

Proprietà (B)

•  Similarità documento-‐documento

•  Similarità termine-‐termine

AAT =UΛ2UT

ATA =VΛ2VT

Proprietà (B)

•  Inoltre:

•  v1 autoveEore rela3vo a σ1 (l’autovalore più grande)

ATA( )k =VΛ2kV T

ATA( )k ≈ v1σ12kv1

Proprietà (C)

•  Per qualsiasi veEore v – Conseguenza: procedura itera3va per il calcolo degli autoveEori

ATA( )k v ≈ λv1T

Proprietà (C)

•  AmmeEe soluzione

Ax = b

x =VΛ−1UTb

Proprietà (C)

•  conseguentemente

Av1 =σ1u1u1T A =σ1v1

ATAv1 =σ 1

PCA e MDS Principal Components Analysis (PCA) •  Da3{Xi}i=1,…,n con Xi veEori reali,

trova il soEospazio k-‐dimensionale P e il mapping Yi=PXi

t.c.. Variance(Y) è massima (o Error(Y) è minimo)

•  SVD sulla matrice di covarianza C =XXT

Mul3dimensional Scaling (MDS) •  Da3 {Xi}i=1,…,n con Xi veEori reali,

trova il soEospazio k-‐dimensionale P e il mapping Yi=PXi

t.c. Dist(Yi-‐Yj) = Dist(Xi-‐Xj) (ovvero distanze preservate)

• SVD sulla matrice matrix G = XT X

LSI/SVD e power laws

•  Gli autovalori più grandi della matrice di adiacenza di un grafo scale-‐free sono distribui3 con una power-‐law.

Caso di Studio: Social Network Analysis

•  Obie_vo: iden3ficare proprietà e relazioni tra i membri di al Qaeda

•  Il dataset fornito da Marc Sageman con3ene informazioni su 366 membri dell’associazione terorris3ca all’inizio del 2004

•  AEribu3:

al Qaeda Dataset

•  Grafo delle relazioni: 366 nodi e 2171 archi. •  Il grado massimo del grafo è 44, mentre

quello medio è 6.44. •  Il diametro è 11 •  Bavelas-‐LeaviE Centrality: rapporto tra la

somma dei cammini geodesici aven3 come sorgente/des3nazione il nodo considerato e la somma dei cammini geodesici dell’intero dataset

al Qaeda Dataset:

al Qaeda Dataset: Link Analysis

•  Analisi della matrice di adiacenza 366 x 366 –  Conta_ e relazioni tra i membri

Plot of the low rank (3) SVD of al Qaeda members using only rela3onship aEribute

•  4 cluster

•  Hambali ha un ruolo di connessione

•  bin Laden non è l’elemento estremo del cluster che iden3fica la leadership

Algerians"South East Asian"

Leaders and core Arabs"

The University of Arizona group have analyzed this dataset and used multidimensional scalingto produce a picture of the group’s connectivity (Jie Xu, personal communication, 2004). Thisshows that the dataset is naturally clustered into 13 almost-cliques, with about 60 members notallocated to a single clique.

A graph of the links within al Qaeda is maintained by Intelcenter and can be viewed on theirweb site (www.intelcenter.com/linkanalysis.html). While the graph is compendious, it is hard toextract actionable information from it.

4 Analysis using matrix decompositions

4.1 Using the links between individuals

In this section we consider only the results of enhanced link analysis, that is we consider the graphof relationships among al Qaeda members. The base dataset is a 366 £ 366 adjacency matrix forthe graph that includes: acquaintances, family, friends, relations, and contacts after joining.

−0.35−0.3

−0.25−0.2

−0.15−0.1

−0.050

−0.1

Figure 3: SVD plot of al Qaeda members using only relationship attributes.

Figure 3 shows a 3-dimensional (truncated) view of the relationships among al Qaeda membersextracted from their links. The most obvious fact is that there is a clear division into three (perhapsfour) clusters. This radial pattern is typical: those points at the extremities represent individualswith the most interesting connections to the rest of the group. Many members are either connected

SVD e centralità

Misura l’importanza di un nodo •  degree centrality – numero di link di un nodo

•  betweenness centrality –numero di cammini che lo contengono

•  closeness centrality -‐ potenziale di comunicazione indipendente

•  eigenvector centrality – connessioni a nodi con high-‐degree, itera3vamente

Eigenvector centrality

•  Riformulato, risulta essere

x j ≈ aij xii=1

Ax =σ x

IL WEB

StruEura del Web 13.6. EXERCISES 395

12 13 14

Figure 13.8: A directed graph of Web pages.

(b) Name an edge you could add or delete from the graph in Figure 13.8 so as to

increase the size of the set IN.

(c) Name an edge you could add or delete from the graph in Figure 13.8 so as to

increase the size of the set OUT.

3. In Exercise 2, we considered how the consistuent parts of the bow-tie structure change

as edges are added to or removed from the graph. It’s also interesting to ask about

the magnitude of these changes.

(a) Describe an example of a graph where removing a single edge can reduce the size

of the largest strongly connected component by at least 1000 nodes. (Clearly you

shouldn’t attempt to draw the full graph; rather, you can describe it in words,

and also draw a schematic picture if it’s useful.)

(b) Describe an example of a graph where adding a single edge can reduce the size

of the set OUT by at least 1000 nodes. (Again, you should describe the graph

rather than actually drawing it.)

Source: David Easley, Jon Kleinberg Networks, Crowds, and Markets, Cambridge University Press (2010)

StruEura del web 13.3. THE WEB AS A DIRECTED GRAPH 387

I'm a student at Univ. of X

Company Z's home page

Our Founders

Press Releases

Contact Us

Univ. of X

Classes

Networks

Networks class blog

Blog post about college rankings

I teach at Univ. of X

USNews: College

Rankings

USNews: Featured Colleges

Blog post about

Company Z

I'm a applying to college

My song lyrics

Figure 13.6: A directed graph with its strongly connected components identified.Source: David Easley, Jon Kleinberg Networks, Crowds, and Markets, Cambridge University Press (2010)

Bow-‐Tie

SCCIN OUT

tendrils

disconnected components

Source:A. Broder, et at.. Graph structure in the Web. In Proc. WWW, pages 309–320, 2000.

Il problema della Ricerca

•  Inserisci un termine nella pagina di google – Analizza i risulta3

•  Il primo elemento è quello che 3 aspeEavi? •  Come ha faEo google a calcolare il risultato?

Search

•  Un problema difficile –  Informa3on retrieval: ricerca in grosse repositories, sulla base di keywords

–  Keywords limitate e inespressive, e: •  sinonimia (modi mul3pli per dire la stessa cosa: casa, abitazione) •  Polisemia (significa3 mul3pli per lo stesso termine: Jaguar, Apple)

–  Differen3 modalità di authoring •  Esper3, novizi, etc.

–  Estrema dinamicità del web –  Shi�

•  Scarcity -‐> abundance

Hubs, Authori3es

•  Un problema di links – Perché wikipedia è in cima agli elemen3 suggeri3?

Hubs, authori3es

•  Molte pagine contengono il termine “re3 sociali” – Perché wikipedia è più rilevante?

•  Indicheres3 wikipedia come riferimento?

Hubs, authori3es

•  Votazione 400 CHAPTER 14. LINK ANALYSIS AND WEB SEARCH

Wall St.

Journal

New York

USA Today

Yahoo!

Amazon

Facebook

2 votes

4 votes

3 votes

1 vote

3 votes

SJ Merc

News 2 votes

Figure 14.1: Counting in-links to pages for the query “newspapers.”

A List-Finding Technique. It’s possible to make deeper use of the network structure

than just counting in-links, and this brings us to the second part of the argument that links

are essential. Consider, as a typical example, the one-word query “newspapers.” Unlike

the query “Cornell,” there is not necessarily a single, intuitively “best” answer here; there

are a number of prominent newspapers on the Web, and an ideal answer would consist of a

list of the most prominent among them. With the query “Cornell,” we discussed collecting

a sample of pages relevant to the query and then let them vote using their links. What

happens if we try this for the query “newspapers”?

What you will typically observe, if you try this experiment, is that you get high scores for a

mix of prominent newspapers (i.e. the results you’d want) along with pages that are going to

receive a lot of in-links no matter what the query is — pages like Yahoo!, Facebook, Amazon,

and others. In other words, to make up a very simple hyperlink structure for purposes of

Source: David Easley, Jon Kleinberg Networks, Crowds, and Markets, Cambridge University Press (2010)

Hubs, authori3es 402 CHAPTER 14. LINK ANALYSIS AND WEB SEARCH

Wall St.

Journal

New York

USA Today

Yahoo!

Amazon

Facebook

new score: 19

new score: 31

new score: 24

new score: 5

new score: 15

SJ Merc

new score: 19

new score: 12

Figure 14.3: Re-weighting votes for the query “newspapers”: each of the labeled page’s newscore is equal to the sum of the values of all lists that point to it.

of links to on-line newspapers; for “Cornell,” one can find many alumni who maintain pages

with links to the University, its hockey team, its Medical School, its Art Museum, and so

forth. If we could find good list pages for newspapers, we would have another approach to

the problem of finding the newspapers themselves.

In fact, the example in Figure 14.1 suggests a useful technique for finding good lists. We

notice that among the pages casting votes, a few of them in fact voted for many of the pages

that received a lot of votes. It would be natural, therefore, to suspect that these pages have

some sense where the good answers are, and to score them highly as lists. Concretely, we

could say that a page’s value as a list is equal to the sum of the votes received by all pages

that it voted for. Figure 14.2 shows the result of applying this rule to the pages casting votes

in our example.

•  Compilazione di liste – Ogni pagina “rappresenta” quelle che la puntano

Hubs, authori3es

•  Miglioramento itera3vo •  Normalizzazione •  Authori3es

– Le pagine che rappresentano gli end-‐points •  Hubs

– Le pagine che rappresentano molte altre pagine (e il cui voto conseguentemente conta tanto)

Hubs, authori3es

•  Hubs

•  Authori3es

Authority value

•  Dato un nodo i:

•  Generalizzando su ogni nodo:

ai = hk + hl + hm

a =ATh

CMU SCS

Kleinberg’s algorithm

ai = hk + hl + hm

that is

ai = Sum (hj) over all j that (j,i) edge exists

a = AT h

Hub value

•  Dato un nodo i:

•  Generalizzando su ogni nodo:

hi = an + ap + aq

CMU SCS

Kleinberg’s algorithm

symmetrically, for the ‘hubness’:

hi = an + ap + aq

that is

hi = Sum (qj) over all j that (i,j) edge exists

h = A a

Algoritmo HITS

•  In conclusione, s3amo cercando due veEori h e a tali che

a =AThh = Aa

(laboratoriodi) sistemi)informaci )avanza...

Documents

a minha rua joão manco

el truco del manco

128001025 baris manco slayt biography

manco indictment

manco,salazar,casas,repetidor (1)

aktion piaggio 27.10 - manco

calculo zapata z-1 manco capac

laboratoriodi fisica nelcorso di fondamentidi...

agr. ana lucia manco m. - idserperu.org

manco empleo septiembre 2011

laboratoriodi tecnologie biomediche

manco helix 6150 - 150cc parts manual

manual manco modulo i

esercitazione 5 [modalit compatibilit...

snip 15 manco

baris manco slayt biography

carta notarial víctor manco

manco and manco indictment

introduzione e concetti di base data selection information...

manco 286 manual