property testing of data dimensionality robert krauthgamer icsi and uc berkeley joint work with ori...

26
Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Upload: solomon-vale

Post on 31-Mar-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Property Testing of Data Dimensionality

Robert Krauthgamer

ICSI and UC Berkeley

Joint work with Ori Sasson (Hebrew U.)

Page 2: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 2

Data dimensionality• The analysis of large volumes of complex data is required

in many disciplines.

• Such data is frequently represented by vectors in a high-dimensional vector space.– E.g., sequential biological data (genome, proteins)– A common method of representing data is feature extraction

(vector representation in feature space).• Images databases• Text corpora (via latent semantic indexing)

Page 3: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 3

The issue of dimension• High-dimensional data is difficult to work with.

– Complexity of many operations is heavily dependent (e.g. exponentially) on the dimension.

• Real-life data often adheres to a low-dimensional structure– Which allows to effectively reduce the dimension.– E.g. in R2:

• Dimensionality Reduction: Mapping into low-dimensional space (while preserving most of the data “structure”)– Trade-off accuracy for computational efficiency

Page 4: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 4

Dimensionality reduction methods

• Singular Value Decomposition (SVD)– I.e., low-rank matrix approximation. – Practical variants: Multidimensional Scaling (MDS), Principal

Component Analysis (PCA)

• Low-distortion embedding in low-dimensional lp

– Of any Euclidean metric [Johnson-Lindenstrauss’86]– Of any metric [Bourgain’86, Linial-London-Rabinovich’93].

• Other methods, e.g. combinatorial feature selection [Charikar-Guruswami-Kumar-Rajagopalan-Sahai’00]

Linear Structure

Metric Structure

Page 5: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 5

Property testing frameworkRelaxed decision problems: Determine whether • The input has a property P, or• The input is far from having the property P, i.e. it needs to

be modified significantly in order to have the property.Goal: Obtain • Randomized algorithms (correct with probability 2/3),• Whose complexity is low (does not depend on input size).

Trivial example: Testing if an input list contains only 0’s or -fraction of the entries are not 0 – with queries.

Page 6: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 6

Testing data dimensionalityGiven a data set S, determine whether• S has at most a (fixed) dimension d, or• S is -far from having this property,

– i.e. at least an -fraction of the entries of (a representation of S) needs to be modified for S to have the property.

Technicalities:• Interpretation of dimension (i.e. type of structure)• Representation of S

– Assume it affects both query mechanism and farness measure

Page 7: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 7

Our results – Testing for linear structure• Algorithm for testing whether vectors v1,…,vn lie in linear

(or affine) subspace of dimension d.– Algorithm queries O(d/) vectors.– Holds for every vector space V.

• Algorithm for testing whether a matrix Amn has rank d.– Algorithm queries the entries of an O(d/) O(d/) submatrix.– Holds for matrices over any field F.

(Both algorithms have one-sided error.)

Page 8: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 8

Our results – Testing for metric structure• Testing whether v1,…,vn l2

m can be embedded into l2

d

– Isometrically - achieved by querying O(d/) vectors (corollary).– With distortion - requires querying ((n/)1/2) vectors.– With perturbation - requires (min{n1/2 , m/log m}) queries.

• Testing whether vectors v1,…,vn l1

m can be embedded isometrically into l1

d requires querying (n1/4) vectors.

(Lower bounds are for algorithms with two-sided error.)

Page 9: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 9

Our results – Testing metrics and norm• Algorithm for testing whether a matrix Mnn is the

distances matrix of a d-dimensional Euclidean metric.– Algorithm queries the entries of an O(d/) O(d/) submatrix.– Slight improvement over O((dlog d)/) O((dlog d)/) of

[Parnas-Ron’01].

• Algorithm for testing whether a vector has lp-norm .– Algorithm queries O( log 1/) entries (with two-sided error).– Holds for any p and .– Allows to test the Frobenius norm of a matrix (such as the

difference between a matrix and its low-rank approximation).

Page 10: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 10

Property testing origins• Introduced by [Rubinfeld-Sudan’96]

– Testing algebraic properties of functions

• Many PCPs involve testing of encodings– E.g. low-degree polynomials, Hadamard code, long code

• Testing of combinatorial properties initiated by [Goldreich-Goldwasser-Ron’98]– They focused on graph properties (e.g. coloring).– Later works considered testing monotonicity of functions,

satisfiability of formulas, regularity of languages, equality of distributions, clustering of Euclidean vectors, metric spaces etc.

Page 11: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 11

Related work• Property testing

– Testing whether a distances matrix represents a tree metric, ultra-metric, or a low-dimensional Euclidean metric [Parnas-Ron’01].

– Testing properties of Euclidean vectors, e.g. clustering [Alon-Dar-Parnas-Ron’00] and convexity [Czumaj-Sohler-Ziegler’00].

– Testing various matrix properties, e.g. monotonicity [Newman-Fischer’01].

• Fast low-rank approximation (by sampling)– [Frieze-Kannan-Vempala’98, Achlioptas-McSherry’01]– Farness measure considers the magnitude of the changes.– Sampling depends on input size (unless input is “uniform”).

Page 12: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 12

Other related work• Finite point criterion for lp

d – embeddability.– Namely, the minimum fp(d) such that

(any) metric space embeds in lp

d iff every fp(d) of its

points do.

– For p = 2, [Menger’28] showed fp(d) = d+3 .

– For p = 1 and any d > 2, [Bandelt-Chepoi-Laurent’98] showed f1(d) d2-1, but it is not known whether f1(d) is finite.

• Our results for l1 and l2 spaces establish somewhat similar

bounds for a relaxed version of this question.

Page 13: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 13

Algorithm for testing linear structureThm 1. Testing whether a set of vectors S lies in a subspace

of dimension d can be achieved with O(d/) queries.

The algorithm.

1. Query O(d/) vectors of S uniformly at random.

2. Accept if (and only if) the queried vectors lie in a linear (or affine) subspace of dimension d.

Page 14: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 14

Proof of testing linear structureProof (correctness).Algorithm always accepts a data set S of dimension d.Let S be -far from having dimension d.• Consider sampling the O(d/) vectors one by one.• Let Xt be the dimension of the subspace spanned by the

first t sampled vectors.• Lemma 1. Pr[Xt+1 = Xt + 1 | Xt d] .• Proof. Since S is -far from having dimension d, the

subspace spanned by the first t sampled vectors contains less than -fraction of the vectors of S.

Page 15: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 15

A technical lemma• Lemma 2. Let 0 X0 X1 X2 . be random variables.

If Pr[Xt+1 = Xt + 1 | Xt d] for all t 0,

then for t* = 8d/we have Pr[Xt* d] < 1/3.• Proof sketch. Xt has binomial distribution as long as Xt d.

Then E[Xt*] 8d and using Chernoff Pr[Xt* d] < 1/3.

So with probability 2/3 we have Xt* d and the algorithm rejects (for S that is -far from dimension d).

This completes the proof of Thm 1.– Similar approach allows to test if a matrix is low-rank and for

distances matrix (slight improvement over [Parnas-Ron’01]).

Page 16: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 16

Lower bound for l1

Thm 2. Testing whether n vectors in l1

m can be embedded isometrically into l1

d requires querying (n1/4) vectors.

• Consider first algorithms with one-sided error.• Suppose d=1, m=2.• Consider the following point set S:

• S is 1/24-far from l1

d-embeddability

because every “” cannot

be embedded in the line.

Page 17: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 17

Lower bound for l1 with one-sided error • Assume there is an algorithm that queries t << n1/2 points.• WLOG it sees a “random” sample of S.• With high probability 1 – O(t2/n) = 1 – o(1)

– The sample contains no two points at distance O(1)

from each other.

– Then sample is l1

d–embeddable (since there is

a geodesic line going through all its points).– And so algorithm must accept S.

• Contradiction (since S is 1/24-far).

Page 18: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 18

Lower bound for l1 with two-sided error• We (randomly) create from S another data set S’ such that

– S’ embeds in the line (WHP 1-o(1)). – The algorithm’s view of S differs from

its view of S’ with probability o(1),– So probabilities of accepting S vs. that

of S’ differ by o(1)<<1/3. Contradiction.

• Here (to prove Thm 2):– Create S’ by choosing r << n1/2 random points from S and

duplicating each one n/r times.– Then a sample of << r1/2 points from S,S’ is almost the same.

These inputs

look the same

Page 19: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 19

Lower bound for l2 with perturbationThm 3. Testing whether n vectors in l2

m can be perturbed by to be l2

d– embeddable requires (min{n1/2 , m/log m}) queries.

• Let d=0 (I.e. testing if the vectors are in a ball of radius ).

• Consider a sphere of radius ’ =(1+1/2n) in l2

m.

• Let S’ consist of n random vectors from this sphere.• Let S consist of n/2 random vectors from the sphere and

their n/2 antipodal vectors (-v).

Page 20: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 20

Lower bound for l2 with perturbation• WHP, the vectors of S’ are in a

ball of radius – By concentration of measure,

WHP they are nearly orthogonal, e.g. the distance between every two is roughly .

– In fact, WHP they are all at distance < from their “center of mass”, as claimed.

S’

Concentration of measure

YES

Page 21: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 21

Lower bound for l2 with perturbation• S is 1/2-far from being in a ball of

radius – Because the distance between

antipodal vectors in S is 2’ .

• Assume algorithm queries << n1/2

– WHP view of S, S’ is the same.– So, probability of accepting S and S’

should differ by o(1).– Contradiction. This proves Thm 3.

S

Antipodals

NO

Page 22: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 22

Lower bound for l2 with distortionThm 4: Testing whether n vectors in l2

m can be embedded in l2

d with distortion requires ((n/)1/2) queries.

• Let d=1 (embedding into a line with distortion ).

• Consider a unit circle with equally spaced 10 points.• Let S consist of points from n/10 (far apart) parallel

copies of this circle in R3.

Page 23: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 23

Lower bound for l2 with distortion

• S is 1/10-far from having an embedding with distortion – Since embedding each cycle into the line requires distortion > .

points

NO

Page 24: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 24

Lower bound for one-sided error

• Assume algorithm queries << (n/)1/2 points of S – WLOG it sees a “random” sample of S.– WHP, this sample contains at most one point from each circle,– And then it can be embedded with distortion < into the line (by

mapping each point to its circle’s center).– So WHP algorithm must accept S. Contradiction.

points

YES

Page 25: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 25

Lower bound for two-sided error• We create S’ by choosing one point from each circle of S

and duplicating it 10 times.– Then S’ can be embedded with distortion < into the line.– WHP view of << (n/)1/2 points from S is the same as from S’.– So, probability of accepting S and S’ should differ by o(1).– This proves Thm 4.

Page 26: Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Testing Data Dimensionality 26

Future research• Testing whether

– A matrix spectral norm ||A||2 is small.

– A distances matrix represents metric (triangle inequality).

– A distances matrix represents an l1

d – metric.

– A distances matrix represents an approximate l2

d – metric.

• Testing with farness measure that depends on magnitude – a la [Frieze-Kannan-Vempala’98, Achlioptas-McSherry’01]