wei cheng 1, xiaoming jin 1, and jian-tao sun 2 intelligent data engineering group, school of...
TRANSCRIPT
![Page 1: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/1.jpg)
Wei Cheng1, Xiaoming Jin1, and Jian-Tao Sun2
Intelligent Data Engineering Group, School of Software, Tsinghua University1
Microsoft Research Asia2
ICDM 2009, Miami
![Page 2: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/2.jpg)
OutlineMotivation & Problem
Our Solution
Experiments
Related Work
Summary and Future Work
![Page 3: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/3.jpg)
MotivationMultidimensional data are everywhere
Time series stock data data collected from sensor monitor
Feature vectors extracted from images or texts……
Similarity query on multidimensional data is importantdata miningdatabase information retrieval
![Page 4: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/4.jpg)
Similarity query is challenging when the data is incompleteData incompleteness happens when:
Sensors do not work properlyCertain features are missing from particular
feature vectors…….
XXSensor data
… …… …
2 3 12 … …
11Text vector C1 4 Y 9 … …
Image vector Z 2 5 11 … …
Query
In order to process similarity query, imputation is necessary. (i.e. by “completing” the missing data by filling in specific values)
![Page 5: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/5.jpg)
Dimension incomplete dataDimension incomplete data satisfies:
(a) At least one of its data elements is missing; (b) The dimension of the missing data element
can not be determined.E.g.
Observed data:
But we know the complete data should be of three dimensions
Data missing might happen on the first, second or third dimension.
3 6( )
![Page 6: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/6.jpg)
Causes of dimension incompleteDimension incompleteness happens when:
Data missing happens while using the order as the implicit dimension indicator
The dimension indicator itself may also be lost……
![Page 7: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/7.jpg)
Similarity query is more challenging when the dimension is incomplete
To measure the similarity between query and the dimension incomplete data object, we should first recover the incomplete data.
Enumerating all combination cases? – Time costingE.g. Xobs : 3 6 lost one dimension
3 possible results after data recovery
)
( )3 6( )3 6( )
3 6( )
XX
XX
XX
Imputed element
For an m-dimensional data
object which has n elements missing, there will be Cm
n cases to recover it.
![Page 8: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/8.jpg)
Problem statement:, ( , , , )
{ | [ ( , ) ] }obs
PSQ DID D Q r c
X D P Q X r c
![Page 9: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/9.jpg)
Two assumptions:The probability of using each recovery result is
equal.
The missing values obey normal distribution.
| || |
[ ( , ) ][ ( , ) ] rv
mis
rvX
XQ
P Q X rP Q X r
C
![Page 10: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/10.jpg)
Efficient approach for PSQ-DIDA gradual refinement search strategy including
two pruning methods:Lower/upper bounds of confidenceProbability triangle inequality
Our Overall Query Process
![Page 11: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/11.jpg)
Lower and upper bounds of confidenceThe missing part and the observed part of the
dimension incomplete data are treated separately. Since we use Euclidean distance, we have:
2 2 2rv obs obs mis mis(Q, X ) (Q , X ) (Q , X )
Lower/upper bounds of the observed part, denoted by δLBobs
and δUBobs.
Lower/upper bounds of the missing part, denoted by δLBmis
and
δUBmis.
( , )| | | |
( , )minobs obs
obs obs
LB Q X obs obsQ X
Q X
( , )
(arg min { ( , ( )) || | | |}, )mis mis
mis
LB Q X
Q mis mis mis mis misQ E X Q X X
![Page 12: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/12.jpg)
E.g.Xobs=(2,8,7), Q=(1,4,5,6,7)
δ2LBobs
(Q, Xobs)=(2-1)2+(8-6)2+(7-7)2 = 5
corresponding recovery version: (2,8,7,x1,x2)
For the imputed random variables Xmis={x1,x2}, If the imputation policy is using the mean value of the two adjacent observed elements as the expectation of the imputed random variables, then
δ2LBmis
(Q , Xmis )=(4-x1)2+(5-x2)2,(E(x1)=E(x2)=5),
corresponding to Xrv =(2, , , 8, 7).
5 5
![Page 13: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/13.jpg)
Denoted by: ,
2 2 2[ ( , ) ] [ ( , ) ( , ) ]mis obsUB mis UB obsP Q X r P Q X Q X r
2 2 2[ ( , ) ] [ ( , ) ( , ) ]mis obsLB mis LB obsP Q X r P Q X Q X r
Lower and upper bounds of confidence
2 ( , )LB Q X 2 ( , )UB Q X
We prove that
![Page 14: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/14.jpg)
Probability triangle inequalityGiven a query Q and a multidimensional data
object R (|Q| = |R|). For a dimension incomplete data object Xobs whose underlying complete version is X, we have:
(1)
(2)
[ ( , ) ] [ ( , ) ( , ) ]LBP Q X r P R X Q R r
[ ( , ) ] [ ( , ) ( , ) ]UBP Q X r P R X Q R r
Calculated in advance and stored in the
database O(|Xobs|(|Q|-|
Xobs|)2)
Calculated during query processing
O(|Q|)
![Page 15: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/15.jpg)
ExperimentsData sets:
Standard and Poor 500 index historical stock data(S&P500) (251 dimensions) A new data set with 30 dimensions
by segmenting the S&P500 data set, resulting in 4328 data objects.
Corel Color Histogram data (IMAGE) 68040 images 32 dimensions
Dimension incomplete data set:randomly removing some dimensions of each data
object.
![Page 16: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/16.jpg)
Experiment SetupGround truth:
Similarity query results on the complete dataPerformance measures
Precision, recall, pruning powerPruning power=Ndefinite/Nprocessed
Nprocessed : number of all data objects
Ndefinite: number of data objects judged as dismissals or search results by the pruner.
Query: 100 data objects randomly sampled from the data set
![Page 17: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/17.jpg)
Effectiveness of probabilistic similarity query on dimension incomplete data
Query precision on S&P500 data set
Query recall on S&P500 data set
![Page 18: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/18.jpg)
Effectiveness of probabilistic similarity query on dimension incomplete data
Query precision on IMAGE data set
Query recall on IMAGE data set
![Page 19: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/19.jpg)
Effect of the confidence thresholdMissing ratio=0.1; r=60 for S&P500, r=0.7 for IMAGE data
Confidence threshold vs precision-recall
![Page 20: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/20.jpg)
Effectiveness of different pruners
Pruning power of probability triangle inequality
![Page 21: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/21.jpg)
Pruning Power of Four PrunersPruner1: probability triangle inequality using confidence lower
bound confidence; Pruner2: probability triangle inequality using confidence upper bound confidence; Pruner3: confidence lower bound; Pruner4: confidence upper bound
missing ratio=10%, c= 0.1, number of assistant objects=20
Pruning power of four pruners
![Page 22: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/22.jpg)
Comparison of query quality when neglecting naïve verificationFor data objects that the four pruners can not judge,
Pos simply outputs as query results, Neg, by contrast, judges them as dismissals.
c=0.1
Comparison of query quality
![Page 23: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/23.jpg)
Performance analysis
Time cost
![Page 24: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/24.jpg)
Related WorkFew research papers discuss similarity search
on dimension incomplete dataIncomplete data
Recovery D. Williams et al. [ICML’05], K. Lakshminarayan et
al. [Applied Intelligence’99],…Indexing
G. Canahuate et al. [EDBT’06], B. C. Ooi et al. [VLDB’98],…
Uncertain dataJ. Pei et al.[Sigmod’08], D. Burdick et al.
[VLDB’05],…Dimension incomplete data
Symbolic sequences J. Gu et al. [DEXA’07]
![Page 25: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/25.jpg)
Summary and Future WorkProblem:
Tackle the similarity query on a new uncertain form (dimension incomplete)
Solution:Lower and upper bounds of confidence
So that we can avoid enumerate all C|Q||Xmis| recovery cases
Probability triangle inequality Further boost the performance in query processing
procedureFuture work
Other similarity measurementsIndex dimension incomplete data
![Page 26: Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2](https://reader035.vdocuments.mx/reader035/viewer/2022062221/56649e025503460f94aec4d7/html5/thumbnails/26.jpg)
Many thanks!