mining and querying multimedia data fan guo sep 19, 2011 committee members: christos faloutsos,...

Post on 30-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mining and Querying Multimedia DataFan GuoSep 19, 2011Committee Members:Christos Faloutsos, ChairEric P. XingWilliam W. CohenAmbuj K. Singh, University of California at Santa Babara

What this talk is about?

2

What this talk is about?

3

Going Multimedia

4

Beyond Text and Images

5

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

6

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

7

Mining Multimedia Data (1)

•Labeling Satellite Imagery

8

Input Output

Mining Multimedia Data (2)

•Network Traffic Log Analysis

9

Mining Multimedia Data (3)

•Web Knowledge Base

10

Mining Multimedia Data

•Data-driven problem solving over multiple modes at a non-trivial scale.

11

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

12

Querying Multimedia Data (1)

•A querying system provides an interface to retrieve records that best match users’ information need.

13

Querying Multimedia Data (1)

•Here is another example:

14

https://www.facebook.com/pages/browser.php

Querying Multimedia Data (1)

•May be transformed into a graph search problem

15

Querying Multimedia Data (2)

•Calibrate ranking from user feedback

16

Querying Multimedia Data (2)

•Calibrate ranking from user feedback

17

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

18

Data

•Large-Scale Heterogeneous Networks

19

Port

198.129.1.2131.243.2.10

131.243.2.5

128.3.10.40 128.3.1.50

IP-source IP-destination

80 (HTTP)

80 (HTTP)

993 (IMAP)

Goal

•How can we automatically detect and visualize patterns within a local community of nodes?

20

Preliminary

•Tensor for high-order data representation▫3 data modes: source IP, destination IP,

port #

21

Approach

22

Data Decomposition

•The canonical polyadic (CP) decomposition can factor tensor into a sum of rank-1 tensors

23

Data Decomposition

•A special case is Singular Value Decomposition

24

Attribute Plot

25

How to compute?

Spike Detection

•Iteratively search for spikes in the histogram plot along each data mode.

26

“ “” ”

Substructure Discovery

•Focus on part of the data within the spike

•Categorize into a few subgraph patterns

27

Pattern 1: Generalized Star (1)

28

IP-src’s sending

packets to the same IP-

dst & the same portTypical

client/server system

Pattern 1: Generalized Star (1)

29

A ‘bar’ in a carefully

reordered tensor

Pattern 1: Generalized Star (2)

30

Extending along “Port-Number”

Pattern 1: Generalized Star (2)

31

Port scanning or P2P

Port numbers used in

packets from the same IP-

src to the same IP-dst

Pattern 2: Generalized Bipartite-Core (1)

32

A ‘plane’ in a carefully

reordered tensor

Pattern 2: Generalized Bipartite-Core (1)

33

IP-src’s sending

packets to the same IP-dst’s & the same portClients

talking to a shared server

pool

Pattern 2: Generalized Bipartite-Core (2)

34

A ‘plane’ in a carefully

reordered tensor

Pattern 2: Generalized Bipartite-Core (2)

35

IP-src’s sending

packets over multiple

ports to one IP-dstA multi-

purpose windows server

M1: MultiAspectForensics

•Automatically detects novel patterns in heterogenous networks

36

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

37

QMAS: Mining Satellite Imagery (1)•Low-labor labeling

38

Input Output

QMAS: Mining Satellite Imagery (2)•Low-labor labeling•Identification of Representatives

39

QMAS: Mining Satellite Imagery (2)•Low-labor labeling•Identification of Representatives and

Outliers

40

QMAS: Mining Satellite Imagery (2)•Low-labor labeling•Identification of Representatives and

Outliers

41

QMAS: Mining Satellite Imagery (3)•Low-labor labeling•Identification of Representatives and

Outliers•Linear in time & space

42

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

43

Web Search

44

User Clicks as Quality Feedback

45

# of total clicks

Motivation

•Leverage the signal from click data to improve search ranking.

46

Click Through Rate (CTR)

•CTR = # of Clicks / # of Impressions

47

Position Bias

48

Relevance of Web Document

•Relevance = CTR @ Position 1

49

# Clicks @ Position 1# Impressions @ Position 1

=

Problem Definition

•Estimate the relevance of web documents given clicks and their positions.

50

Design Goals / Constraints

•Scalable: single-pass, easy to parallel.

•Incremental: real-time updates possible.

•Accurate: consistent with past and future observations.

51

Approach

52

User Behavior Model

53

Last Clicked Position

54

Empirical Results

•Click data after pre-processing▫110K distinct queries, 8.8M query sessions.

•Training time: <6 mins

•Online update:▫Bump impression and click counters▫No data retention required

55

Empirical Results

•Higher log-likelihood indicates better quality.

56

27% accuracy in prediction2% improvement over ICM, the baseline model

Empirical Results

•Position-bias visualized

57

Ground Truth

DCM

Scaling to Terabytes

•265TB data, 1.15B document relevance results,running time on wall clock ~ 3 hours

58

Q1: Click Models

•A statistical approach to leveraging click data for better ranking aware of position-bias.

•They are incremental, more accurate than the baseline, scaling to almost petabyte-scale data.

59

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

60

Q2: C-DEM

•A flexible query interface for 3-mode data: images, genes, annotation terms.

61

Q2: C-DEM

62

Images

Terms Genes

Q2: C-DEM

•Solution: random walk with restart on graphs.

63

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

64

Q3: BEFH (1)• Bayesian exponential family harmonium• Deriving topical representations for

multimedia corpora (e.g., video snapshots and captions)

65

Input Model

Q3: BEFH (2)• Bayesian exponential family harmonium• Deriving topical representations for

multimedia corpora (e.g., video snapshots and captions)

66

Validation – Synthetic Data

Validation – TRECVID Data

Better Quality

Better Quality

Thesis Outline

MiningM1: MultiAspectForensics

M2: QMAS

Querying

Q1: Click Models

Q2: C-DEM

Q3: BEFH

67

Conclusion

•Data-driven research under the theme of pattern mining and similarity querying.

68

Conclusion

•Data-driven research under the theme of pattern mining and similarity querying.

•An array of practical tasks addressed:▫Internet traffic surveillance (M1)

69

Conclusion

•Data-driven research under the theme of pattern mining and similarity querying.

•An array of practical tasks addressed:▫Internet traffic surveillance (M1)▫Satellite image analysis (M2)

70

Conclusion

•Data-driven research under the theme of pattern mining and similarity querying.

•An array of practical tasks addressed:▫Internet traffic surveillance (M1)▫Satellite image analysis (M2)▫Web search (Q1)

71

Conclusion

•Data-driven research under the theme of pattern mining and similarity querying.

•An array of practical tasks addressed:▫Internet traffic surveillance (M1)▫Satellite image analysis (M2)▫Web search (Q1)▫…

72

Thank You!

•http://www.cs.cmu.edu/~fanguo/dissertation/

73

74

top related