top k knapsack joins and closure early results
DESCRIPTION
Top k Knapsack Joins and Closure Early Results. Witold LITWIN & Thomas Schwarz U. Paris Dauphine, France [email protected] Santa Clara U., CA, [email protected]. Knapsack Join (KS-Join). The join defined by the sum of the join attributes being at most some constant - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/1.jpg)
1
Top k Knapsack Joins and Closure Early Results
Witold LITWIN & Thomas SchwarzU. Paris Dauphine, France [email protected]
Santa Clara U., CA, [email protected]
![Page 2: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/2.jpg)
2
Knapsack Join (KS-Join)
• The join defined by the sum of the join attributes being at most some constant
• Father of 4 kids wishing to buy toys for at most 100€ total
• A person wishing to buy a computer tower, a screen, a printer and a desk for at most 1000 €
• ….
![Page 3: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/3.jpg)
3
Knapsack Join (KS-Join)
• Traditional join–R1 Join R2 on c1 = c2
• KS - join–R1 Join R2 on c1 + c2 ≤ C
• Syntax legal for FROM clause in Access, SQL Server…
![Page 4: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/4.jpg)
4
Top k Knapsack Join (KS-Join)
• Top k items with respect to the descending order on the constant
• Usually, only a few items the most close to the constant are of interest
• Select TOP 1 * from Toys T1, Toys T2, Toys T3, Toys T4Where T1.Price + T2.Price + T3.Price + T4.Price ≤ 100and T1.Id < T2.Id and T2.Id < T3.Id and T3.Id < T4.IdOrder by T1.Price + T2.Price + T3.Price + T4.Price Desc;
![Page 5: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/5.jpg)
5
Top k Knapsack Join (KS-Join)• Top k Knapsack joins are of obvious interest• How DBMSs deal with ?• Nested loop– To our best knowledge
• Result: execution time makes the SQL capability useless for a larger data set– Consider our example for just 1000 toys to choose
from– FYI, 1K-tuple table & 3-way KS-join killed SQL Server
![Page 6: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/6.jpg)
6
Our Goal : Optimizing Top k KS-Joins
• Algorithms provably faster than usual nested loop– Formulate the algorithm– Prove the complexity, storage & processing costs
• KS-optimized Nested Loop • Self-join Nested Loop• Sort Merge• KS – Join Indices• Distributed KS – Join Indices
![Page 7: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/7.jpg)
7
Our Goal : Optimizing Top k KS-Joins
• Early Results• Only for Top k KS-Joins (TkKS-Joins) • Only the formal analysis as yet• Many variants of TkKS-Join queries left
for future work– See the paper
![Page 8: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/8.jpg)
8
Knapsack Problem (KP)
• NP-hard optimization problem • Among most studied• Input: – A set O of objects {o1,...on} – An m-d subspace called knapsack K with– values bi , 1 ≤ i ≤ m, represent each the i-th
dimension's capacity of the knapsack– Vector cj represents the benefit of the object j if in
the knapsack
![Page 9: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/9.jpg)
9
Knapsack Problem (KP)
• Input (continued): – The knapsack's constraints matrix with
entries ai,j ; 1 ≤ j ≤ n ; – Each entry stores the constraint value for
each object j in each dimension i (price, size, volume...).
• Output:– A set O' of objects stored in the knapsack.
![Page 10: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/10.jpg)
10
Knapsack Problem (KP)
• Binary variable xj ; xj {0, 1}, indicates the selection of the object j into the knapsack –(x j= 1) for object j in and (xj = 0)
otherwise– xj is 0–1 decision variable
![Page 11: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/11.jpg)
11
Knapsack Problem (KP)
• Select the elements of O’ which maximize the total profit of the selected objects
• Provided the match of the knapsack constraints
![Page 12: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/12.jpg)
12
Knapsack Problem (KP)
• Formally, maximize:
• Subject to:
![Page 13: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/13.jpg)
13
Knapsack Problem (KP)• The most frequently investigated case is the 1-d one– I.e., i = 1
• Often, or perhaps even the most often, the KP concept designates implicitly this case.
• Frequently, in addition, one also sets every cj to cj = aj.
• Both conditions are ours below– unless we state otherwise
• The m-d one is referred to then, if needed, as multidimensional (MKP).
![Page 14: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/14.jpg)
14
Knapsack Problem (KP)• The general research orientation for KP and
MKP • Find a heuristic providing acceptable
approximate result– For the possibly largest data set – In the fastest time, –Or acceptable time–Given necessary constraints on the
computer system used.
![Page 15: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/15.jpg)
15
KP / TkKS -Join• Our research orientation follows the
database approach • Find an exact result–For a reasonably practical problem
subspace–For a database size data •Say, 1Ktuples per table at least
![Page 16: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/16.jpg)
16
KP / TkKS -Join• Find an exact result (continued)–In the fastest time–Or acceptable time•Minutes at most
–Given necessary constraints on the computer system used•Mainly storage cost
![Page 17: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/17.jpg)
17
Knapsack Problem (KP)• Our reasonably practical problem
subspace at present:–As we already stated cj = aj –1-d space– Fixed # of objects for the knapsack• Join instead of closure
![Page 18: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/18.jpg)
18
Knapsack Problem (KP)• Our reasonably practical problem subspace
at present (continued):–One tuple = one potential selection– One object = one tuple with distinct ID•No objects selected twice in a tuple for
the knapsack• Closure, MKP… left for the future
![Page 19: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/19.jpg)
19
Nested loop TkKS-Join• Basic cost for tables with n1…nm tuples– O (n1*…*nm)
• To accelerate the calculus start with:– Evaluation of the restrictions ti < C – Evaluation of ti ≤ C – (Min1+…+Minj+…+Minm) • for any j ≠ i • DBMS may easily maintain the Minj statistics• Cost can be O(m) or even O (1) only
– Idem for C ≥ Max1+…+Maxm ?
![Page 20: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/20.jpg)
20
Nested loop TkKS-Join• Self-joint of a table with its copies• Since KS-join is commutative one may avoid doubles – E.g. if we have tuple (t1, t2) then we should not have the
tuple tuples (t2, t1) – In general, we need only one tuple from all its
permutations • The optimizing cuts the complexity and calculus time
by half, at least• Final word: we may have – O (n1*…*nm /S), where S ≥ 1
![Page 21: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/21.jpg)
21
Sort-Merge TkKS-Join• 2-way join
C =150
150
150
![Page 22: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/22.jpg)
22
Sort-Merge TkKS-Join• Processing cost of 3-way TkKS-Join– O (n1+n2 ) in general– O ((n1+n2 )/2) for self-join
• For n-way TkKS-Join– O (nm*…n3(n2 + n1)) in general– For self-join ?
• E.g. For 16K-tuple R1 and R2 tables m-way join accelerates 8K times
– 1sec instead of 2+ hours• See the paper
![Page 23: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/23.jpg)
23
KS-Join Index
• A relational table IKS with at least the attributes(C, t1.Id,…, tm.Id) – Here C = t1.c+…+tm.c
– Also t1.Id <… < tm.Id
• Can be also seen as a materialized view• Some or all ti.c should be useful as well• E.g. for queries with additional restrictions on
individual prices
![Page 24: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/24.jpg)
24
KS-Join Index• IKS should be implemented as file sorted on C first– Then, on other key or non-key attributes of interest– E.g., a B-tree or trie…
• Storage cost:– O (n1*…*nm) in general– Half of it or less for copies of the same table
• 3-way indices may be in RAM • More should be typically on flash or disk
![Page 25: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/25.jpg)
25
KS-Join Index• Processing cost– O (Log p (n1*…*nm) ) or less, according to the
storage cost, where p is the tree fan-out• Expected practical figures– ms for RAM, e.g., 3-way KS-Join index for 1K-tuple
tables – under 10 ms for flash– under 100 ms for the disk, e.g., 4-way KS-Join
index for our 1K-tuple tables
![Page 26: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/26.jpg)
26
KS-Join Index• Maintainance cost – High processing cost– E.g., 1 insert into our 1K tables generates 1M new
entries • Main drawback of KS-Indices at present• Efficient processing is an open problem
![Page 27: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/27.jpg)
27
Composing KS-Join Indices• TkKS-Join calculus can compound existing KS-
Indices• m-way & n-way indices may speed up (m+n)-
way TkKS Join• Through the sort-merge algorithm applied to
both indices• Seconds may suffice for up to 6-way joins– E.g., for our 1K relations
![Page 28: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/28.jpg)
28
Scalable-Distributed TkKS-Join Index• Speeds up the calculus of even larger joins• Using the parallel distributed processing• Dozens of seconds may suffice for an 8-way
join– Over our favorite 1Ktuple relations–With two 4-way KS-Indices– Each being distributed over 1K nodes– Through, e.g., RP* SDDS
• Maintainance time speeds-up as well
![Page 29: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/29.jpg)
29
Scalable-Distributed TkKS-Join Index
• C = 900 ; arrows show nodes to join in parallel
100 350 800 9900
10 50 450 700
![Page 30: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/30.jpg)
30
Conclusion
• TkKS-Joins are potentially useful• Our optimizations may speed up the
processing by orders of magnitude• Queries with TkKS-Joins become then
practical• With all the usual disclaimers, the results
appear ready for mainstream DBMSs
![Page 31: Top k Knapsack Joins and Closure Early Results](https://reader036.vdocuments.mx/reader036/viewer/2022062520/56816199550346895dd14b01/html5/thumbnails/31.jpg)
31
Future Work
• Deeper formal analysis• Experiments• More TkKS-Join query types– See the paper
Thank You for Your Attention