scientific project: databases for multi-dimensional data ...team+project/material/_/dddpsl… ·...
TRANSCRIPT
Scientific Project: Databases for Multi-dimensionalData, Genomics and modern Hardware
David Broneske, Gabriel Campero,Bala Gurumurthy, Marcus Pinnecke,Gunter Saake
, October 18, 2019
1 / 37
Overview
• Concepts of this course
• Course of action (milestones, presentations)
• Overview of project topics & forming project teams
• How to perform literature research?
• Further lectures:
Academic writing (2 lectures)
2 / 37 Scientific Project David Broneske et al.
Overview
• Concepts of this course
• Course of action (milestones, presentations)
• Overview of project topics & forming project teams
• How to perform literature research?
• Further lectures:
Academic writing (2 lectures)
3 / 37 Scientific Project David Broneske et al.
Organization
Scientific Project: Modules
Bachelor
• Module: WPF FIN SMK (Schlussel- und Methodenkompetenzen)
• 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h autonomouswork
Master
• Module: Scientific Team Project (Inf, IngInf, WIF, CV)
DKE: Methods 2 or Applications
DE: Interdisciplinary Team Project, Specialization
• 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138h autonomouswork
Grade at the end of the course for the whole project team
5 / 37 Scientific Project David Broneske et al.
Scientific Project: Prerequisite
• Successful programming test in C++/Java/Python
• 1h theoretical test in a seminar room (data and place to bediscussed)
• Half of the team members have to pass the test
• Topics:
Some language specifics
General program understanding
Control flow understanding
• You can take all tests and have to pass at least one!
6 / 37 Scientific Project David Broneske et al.
Scientific Project: Semester Plan
Introduction 14.10.2019 15.10.2019 16.10.2019 17.10.2019 18.10.2019Team Formation 21.10.2019 22.10.2019 23.10.2019 24.10.2019 25.10.2019Final Teams 28.10.2019 29.10.2019 30.10.2019 31.10.2019 01.11.2019
04.11.2019 05.11.2019 06.11.2019 07.11.2019 08.11.2019MS-I 11.11.2019 12.11.2019 13.11.2019 14.11.2019 15.11.2019
18.11.2019 19.11.2019 20.11.2019 21.11.2019 22.11.201925.11.2019 26.11.2019 27.11.2019 28.11.2019 29.11.2019
MS II 02.12.2019 03.12.2019 04.12.2019 05.12.2019 06.12.201909.12.2019 10.12.2019 11.12.2019 12.12.2019 13.12.201916.12.2019 17.12.2019 18.12.2019 19.12.2019 20.12.2019
Winter Break ... ... ... ... ...MS III 06.01.2020 07.01.2020 08.01.2020 09.01.2020 10.01.2020
13.01.2020 14.01.2020 15.01.2020 16.01.2020 17.01.202020.01.2020 21.01.2020 22.01.2020 23.01.2020 24.01.2020
MS Final 27.01.2020 28.01.2020 29.01.2020 30.01.2020 31.01.2020
Monday Tuesday Wednesday Thursday Friday
Aca-IAca-II
7 / 37 Scientific Project David Broneske et al.
Scientific Project: Milestones
• Milestone I - Topic, schedule, and team presentation & first resultsof literature research
• Milestone II - Concept & additional literature research
• Milestone III - Implementation & evaluation setup
• Milestone IV - Final presentation (wrap-up + evaluation results)
8 / 37 Scientific Project David Broneske et al.
Concepts & Content
Lecture, Meetings & Presentation
Lecture & Presentation
• Time/Place: Friday, 13:00-15:00, G22A - 208
• Lectures with content of course → all
• Presentation of main milestones (see time table)→ each project team
Meetings (Exercise)
• Individual for each project team
• Time and room to be agreed in project teams!
• Presentation of all intermediate results/milestones (informal)
• Discussion, discussion, discussion . . .
10 / 37 Scientific Project David Broneske et al.
Progress of Course
Deliveries
• 4 milestone presentations (main milestones)
• Each team member has to present at least once
• Reporting of (sub) milestones in exercises/meetings
• Written paper about literature research (technical report)
• Prototypical implementation
11 / 37 Scientific Project David Broneske et al.
Deliveries and Grading (I)
Technical Report
• Delivery of report at a given time (deadline)
• Quality/Quantity of literature research
• Number of pages
• Quality of paper structure and evaluation
• Own contribution
12 / 37 Scientific Project David Broneske et al.
Deliveries and Grading (II)
Presentation & Discussion
• Quality of scientific presentation (structure, references, time)
• Assessment regarding the content (e.g., results of particularmilestones)
• Participation of discussion
Organization
• Strictness
• Communication (just-in-time answers, satisfying time constraints)
• Self-organization (Sharing tasks, internal reporting of currentstate-of-work, dealing with problems)
• Autonomous working
13 / 37 Scientific Project David Broneske et al.
Deliveries and Grading (III)
• Grade consists of:
Presentations: 30%,
Implementation: 30%,
Paper: 30%,
Soft Skills: 10%
• Binding registration: Second Milestone
14 / 37 Scientific Project David Broneske et al.
Objectives & Qualification (I)
Acquired skills, specific to research
• Performing literature research
• Understanding and structured reviewing of scientific work
• Autonomous, solution-based reasoning on research task (e.g.,finding alternative solutions)
• How to ask? How to adapt a task (extend/reduce)?
• Academic writing
15 / 37 Scientific Project David Broneske et al.
Objectives & Qualification (II)
Acquired skills, always needed
• Team management
• Project and time scheduling
• Presentation of results
• Flexibility regarding changing conditions
• Reasoning about solutions (”Why is this the best/not adequate. . . ”)
16 / 37 Scientific Project David Broneske et al.
Task & Time Management
Task Management
• Main milestones have to be finished in time
• (Sub) milestones are less strict (but don’t be sloppy)
• Pre-defined work packages ⇒ each project team
. . . defines sub work packages
. . . determines responsibilities for these packages (divide&conquer)
Time Management
• Planning of periods
• Regarding capacities and resources
• Considering other tasks and activities
• Reporting of delays immediately to project members !
17 / 37 Scientific Project David Broneske et al.
Role Management
• Possible roles: team leader, developer, researcher, . . .
• work together vs. responsibilities: design, implementation, testing,writing, . . .
• Delegate for important roles/work packages
• Assignment of (sub) tasks to role for each milestone
18 / 37 Scientific Project David Broneske et al.
Topic & Project Teams
• Teams with 4 to 6 students
• Most tasks can be chosen once
• Projects
Theoretical part
• State of the art
• New ideas
Practical part
• Usually in C++, Java, or Python
• Prototypical implementation
• Evaluation part
19 / 37 Scientific Project David Broneske et al.
Topic 1 - Fragment Skipper (Gridformation)
Intro: Data skipping in Hadoop
• A good data fragmentation is necessary for scalable processing inBig Data.
• Aggressive data skipping is the SOTA ML approach, usinghierarchical agglomerative clustering with Ward’s method.
• We developed a competitor based on deep reinforcement learning.How good is our solution? (can we make it better?)
Your Task
• Literature research: aggressive data skipping, basics on deepreinforcement learning.
• Prototypical implementation of aggressive data skipping usingSpark. Experimental evaluation and analysis comparing with ourDRL solution.
• Long-term goal: A production-ready open source AI-basedpartitioning tool & a paper for consideration in SysML.
20 / 37 Scientific Project David Broneske et al.
Topic 2 - Similarity Skipper (Sim-Skip)
Intro: Learning to hash for high dimensional data management
• Top-k search in dense high-dimensional vectors requires specializedsolutions. This is a very relevant application.
• When data is large, even parallel scans will be inefficient.
• Learned hashing is the current SOTA for managing such kind ofdata in image domains. How good does this technique perform onstructured relational data?
Your Task
• Literature research: Hadoop file formats, deep hashing, triplettraining of neural networks.
• Prototypical implementation of a deep hashing process usingTensorflow and Spark. Experimental evaluation and analysis.
• Long-term goal: A workshop paper in DEEM@SIGMOD.
21 / 37 Scientific Project David Broneske et al.
Topic 3 - Graphs processing w/ML(MLGQE:Malko)
Intro: Applications of deep learning for graph data processing
• Graphs are everywhere & graph technologies are a moving target. Todate still very heuristic-driven. ML is barely tested.
• Differential neural computers are already able to act like primitive graphquery engines.
• How good do models like these fare nowadays for graph processing, andwhat needs to be improved?
Your Task
• Literature research: differential neural computers, graph nets, RDF3x.
• Prototypical implementation using existing libraries from Deepmind,selected datasets. Experimental evaluation and analysis.
• Long-term goal: A workshop paper in GRADES/AIDM@SIGMOD, orAIDB@VLDB.
22 / 37 Scientific Project David Broneske et al.
Topic 4 - Cross device data parallel query processing
Intro
• Heterogeneous hardware used for improving performance
• Work partitioning is problematic and requires training data for best deviceselection
• Further, data parallel execution cannot be decided in prior
We’ve got
• Dispatcher implementation
• Iterative functional parallel executor
Your Task
• Literature Research: data-parallel, iterative and cross-device execution systems
• Understanding of morsel driven parallelism
• Invention of a clever concept to perform data as well as functional parallelexecution across devices
• Implementation of your concept for missing database operators - hash join
• Benchmarking the execution system
23 / 37 Scientific Project David Broneske et al.
Topic 5 - Evaluating GPU-based Execution Models
Intro
• Query processing models for GPU: block-at-a-time & compiled execution
• Block-at-a-time : execute data bulk/function after next in query pipeline
• Compiled: generates execution code in runtime and execute them together
• Not sure which execution model is most suitable for GPUs
We’ve got
• GPU accelerated DBMS - CoGaDB
• Functionalities for b-a-a-t and compiled execution
• Concepts for evaluating of models in stand-alone CPU
Your Task
• Literature Research: Execution models for heterogeneous hardware
• Understanding execution of CoGaDB
• Developing evaluation suite for GPUs
• Implementation missing operations and the evaluation suite constructs
• Critical analysis on the results and investigation on the resultant values
24 / 37 Scientific Project David Broneske et al.
Topic 6 - Indepth Semi-Structured Data Storage
Intro
• Management of semi-structured data (e.g., JSON) has become daily business
• A widespread of formats for physical storage of JSON-like data exists today (e.g., PlainText,BSON, CBor, Avro, Parquet, FlatBuffers, MessagePack, Smile, UBJSON, Ion, Carbon)
• A comprehensive comparison about design goals, concrete data model, limits and benefits,as well as operators, on an abstract, theoretical level is missing, though
We’ve got
• Carbon (Columnar Binary Json) implementation and specification as starting point
• Running theses on evaluation of Carbon vs BSON
Your Task
• Literature Research: Semi-Structured Data Model (math model for JSON-like data), andexisting evaluations for formats mentioned above → understanding on an abstract level
• Research for existing comparisons, design goals, and applications of models above
• Creation of common theoretical model and common taxonomy for formats above
• List missing evaluations, top 5 formats, and come up with at least 5 further advanced andreasonable metrics not considered so far
• Implementation of an evaluation framework with new metrics and missing evaluations asproof of concepts for at least 5 formats (PlainText, BSON, UBJSON, and Carbon excluded)
25 / 37 Scientific Project David Broneske et al.
Topic 7 - Order-By Queries in Elf
Intro
• Elf: multi-dimensional main memory index structure for efficientselections
• Stores data sorting in a multi-dimensional order
• Common data-intensive operator: Sorting
We’ve got
• Elf implementation in C++
Your Task
• Literature Research: Related index structures and sorting algorithms
• Understanding of the Elf and its optimization concepts
• Implementation Sorting Operator for Elf
• Performance evaluation against sequential scans
26 / 37 Scientific Project David Broneske et al.
Topic 7 - Order-By Queries in Elf
1 2
0 1Column C1
Column C2
(1)
(2) (3)
T2 T1 0
0 T3 1
Column C3
Column C4(5) 0(4)
+
0
01
Accelerating multi-column selection predicates inmain-memory – the Elf approach
David BroneskeUniversity of [email protected]
Veit KoppenUniversity of Magdeburg
Gunter SaakeUniversity of Magdeburg
Martin SchalerKarlsruhe Institute of Technology
Abstract—Evaluating selection predicates is a data-intensivetask that reduces intermediate results, which are the input forfurther operations. With analytical queries getting more andmore complex, the number of evaluated selection predicatesper query and table rises, too. This leads to numerous multi-column selection predicates. Recent approaches to increase theperformance of main-memory databases for selection-predicateevaluation aim at optimally exploiting the speed of the CPU byusing accelerated scans. However, scanning each column one byone leaves tuning opportunities open that arise if all predicatesare considered together. To this end, we introduce Elf, an indexstructure that is able to exploit the relation between severalselection predicates. Elf features cache sensitivity, an optimizedstorage layout, fixed search paths, and slight data compression.In our evaluation, we compare its query performance to state-of-the-art approaches and a sequential scan using SIMD capabilities.Our results indicate a clear superiority of our approach for multi-column selection predicate queries with a low combined selectivity.For TPC-H queries with multi-column selection predicates, weachieve a speed-up between a factor of five and two orders ofmagnitude, mainly depending on the selectivity of the predicates.
I. INTRODUCTION
Predicate evaluation is an important task in current OLAP(Online Analytical Processing) scenarios [1]. To extract nec-essary data for reports, fact and dimension tables are passedthrough several filter predicates involving several columns.For example, a typical TPC-H query involving several columnpredicates is Q6, whose WHERE-clause is visualized in Fig. 1(a).We name such a collection of predicates on several columns inthe WHERE-clause a multi-column selection predicate. Multi-column selection predicate evaluation is performed as early aspossible in the query plan, because it shrinks the intermediateresults to a more manageable size. This task has become evenmore important, when all data fits into main memory, becausethe I/O bottleneck is eliminated and, hence, a full table scanbecomes less expensive.
In case all data sets are available in main memory (e.g., ina main-memory database system [2], [3], [4]), the selectivitythreshold for using an index structure instead of an optimizedfull table scan is even smaller than for disk-based databasesystems. In a recent study, Das et al. propose to use anindex structure for very low selectivities only, such as valuesunder 2 % [5]. Hence, most OLAP queries would never usean index structure to evaluate the selection predicates. Toillustrate this, we visualize the selectivity of each selectionpredicate for the TPC-H Query Q6 in Fig. 1(b). All of itssingle predicate selectivities are above the threshold of 2 %and, thus, would prefer an accelerated scan per predicate.
(a)Q6.1Q6.2Q6.3
l shipdate >= [DATE] and l shipdate < [DATE] + ’1 year’and l discount between [DISCOUNT] � 0.01 and [DISCOUNT] + 0.01and l quantity < [QUANTITY]
(b)Q6
Q6.1 Q6.2 Q6.32 %
20 %
40 %
Sele
ctiv
ity
(c)SIM
DSeq Elf
50
100
150
Res
pons
eTi
me
inm
s
Fig. 1. (a) WHERE-clause, (b) selectivity, and (c) response time of TPC-Hquery Q6 and its predicates Q6.1 - Q6.3 on Lineitem table scale factor 100
However, an interesting fact neglected by this approach isthat the accumulated selectivity of the multi-column selectionpredicates (1.72 % for Q6) is below the 2 % threshold. Hence, anindex structure would be favored if it could exploit the relationbetween all selection predicates of the query. Consequently,when considering multi-column selection predicates, we achievethe selectivity required to use an index structure instead of anaccelerated scan.
In this paper, we examine the question: How can we exploitthe combined selectivity of multi-column selection predicatesin order to speed up predicate evaluation? As a solutionfor efficient multi-column selection predicate evaluation, wepropose Elf, an index structure that is able to exploit therelation between data of several columns. Using Elf resultsin performance benefits from several factors up to two ordersof magnitude in comparison to accelerated scans, e.g., a scanusing single instruction multiple data (SIMD). About factor 18can be achieved for Q6 on a Lineitem table of scale factors = 100, as visible in Fig. 1(c).
Elf is a tree structure combining prefix-redundancy elimina-tion with an optimized memory layout explicitly designed forefficient main-memory access. Since the upper levels representthe paths to a lot of data, we use a memory layout that resemblesthat of a column store. This layout allows to prune the searchspace efficiently in the upper layers. Following the paths deeperto the leaves of the tree, the node entries are representing lesserand lesser data. Thus, it makes sense to switch to a memorylayout that resembles a row store, because a row store ismore efficient when accessing several columns of one tuple.
Author Copy of: David Broneske, Veit Köppen, Gunter Saake, Martin Schäler. Accelerating multi-column selection predicates in main-memory – the Elf approach. IEEE 33rd International Conference on Data Engineering (ICDE), 2017, pp. 647-658, DOI 10.1109/ICDE.2017.118
A. Conceptual designIn the following, we explain the basic design with the
help of the example data in Table II. The data set shows fourcolumns to be indexed and a tuple identifier (TID) that uniquelyidentifies each row (e.g., the row id in a column store).
C1 C2 C3 C4 ... TID0 1 0 1 ... T1
0 2 0 0 ... T2
1 0 1 0 ... T3
TABLE II. RUNNING EXAMPLE DATA
In Fig. 2, we depict the resulting Elf for the four indexedcolumns of the example data from Table II. The Elf tree struc-ture maps distinct values of one column to DimensionListsat a specific level in the tree. In the first column, there aretwo distinct values, 0 and 1. Thus, the first DimensionList,L(1), contains two entries and one pointer for each entry. Therespective pointer points to the beginning of the respectiveDimensionList of the second column, L(2) and L(3). Note,as the first two points share the same value in the first column,we observe a prefix redundancy elimination. In the secondcolumn, we cannot eliminate any prefix redundancy, as allattribute combinations in this column are unique. As a result,the third column contains three DimensionLists: L(4), L(5),and L(6). In the final DimensionList, the structure of theentries changes. While in an intermediate DimensionList,an entry consists of a value and a pointer, the pointer in thefinal dimension is interpreted as a tuple identifier (TID).
1 2
0 1
0
Column C1
Column C2
(1)
(2) (3)
0 T3 0 T21 T1
0 1Column C3
Column C4(7)
(5)
(8) (9)
0(4) (6)
Fig. 2. Elf tree structure using prefix-redundancy elimination.
The conceptual Elf structure is designed from the idea ofprefix-redundancy elimination in Section II-B and the propertiesof multi-column selection predicates. To this end, it featuresthe following properties to accelerate multi-column selectionpredicates on the conceptual level:
Prefix-redundancy elimination: Attribute values are mainlyclustered, appear repeatedly, and share the same prefix.Thus, Elf exploits this redundancy as each distinct valueper prefix exists only once in a DimensionList toreduce the amount of stored and queried data.
Ordered node elements: Each DimensionList is an or-dered list of entries. This property is beneficial for equalityor range predicates, because we can stop the search ina list if the current value is bigger than the searchedconstant/range.
Fixed depth: Since, a column of a table corresponds to a levelin the Elf, for a table with n columns, we have to descendat most n nodes to find the corresponding TID. This setsan upper bound on the search cost that does not dependon the amount of stored tuples, but mostly on the amountof used columns.
In summary, our index structure is a bushy tree structureof a fixed height resulting in stable search paths that allowsfor efficient multi-column selection predicate evaluation on aconceptual level. To further optimize such queries, we alsoneed to optimize the memory layout of the Elf approach.
B. Improving Elf’s memory layoutThe straight-forward implementation of Elf is similar to data
structures used in other tree-based index structures. However,this creates an OLTP-optimized version of the Elf, which wecall InsertElf. To enhance OLAP query performance, we usean explicit memory layout, meaning that Elf is linearized intoan array of integer values. For simplicity of explanation, weassume that column values and pointers within Elf are 64-bitinteger values. However, our approach is not restricted to thisdata type. Thus, we can also use 64 bits for pointers and 32 bitsfor values, which is the most common case.
1) Mapping DimensionLists to arrays: To store the nodeentries – in the following named DimensionElements –of Elf, we use two integers. Since we expect the largestperformance impact for scanning these potentially longDimensionLists, our first design principle is adjacency ofthe DimensionElements of one DimensionList, whichleads to a preorder traversal during linearization. To illustratethis, we depict the linearized Elf from Fig. 2 in Fig. 3. Thefirst DimensionList, L(1), starts at position 0 and has twoDimensionElements: E(1), with the value 0 and the pointer04 (depicted with brackets around it), and E(2), with the value1 and the pointer 16 (the negativity of the value 1 marks theend of the list and is explained in the next subsection). Forexplanatory reasons, we highlight DimensionLists withalternating colors.
0 [04] -1 [16] 1 [08] -2 [12] -0 [10]
-1 -0 [14] -0 -0 [18]T1 T2
-0 T3
ELF[00]
ELF[10]
ELF[20]
0 91 2 3 4 5 6 7 8(1) (2) (4)
(7) (5) (8) (3)
(9)
-1 [20](6)
Fig. 3. Memory layout as an array of 64-bit integers
The pointers in the first list indicate that theDimensionLists in the second column, L(2) andL(3) (cf. Fig. 2), start at offset 04 and 16, respectively. Thismechanism works for any subsequent DimensionListanalogously, except for those in the final column (C4). In thefinal column, the second part of a DimensionElement isnot a pointer within the Elf array, but a TID, which we encodeas an integer as well. The order of DimensionLists isdefined to support a depth-first search with expected low hitrates within the DimensionLists. To this end, we firststore a complete DimensionList and then recursively storethe remaining lists starting at the first element. We repeat thisprocedure until we reach the final column.
2) Implicit length control of arrays: The second designprinciple is size reduction. To this end, we store only valuesand pointers, but not the size of the DimensionLists. Toindicate the end of such a list, we utilize the most significantbit (MSB) of the value. Thus, whenever we encounter a
Beispieldaten
Elf-DatenstrukturPerformanzgewinne
Reorganisation
27 / 37 Scientific Project David Broneske et al.
Topic 8 - Parallel Build of Elf
Intro
• Elf: multi-dimensional main memory index structure for efficientselections
• Stores data sorting in a multi-dimensional order
• Building involves a multi-dimensional sort → expensive
We’ve got
• Elf implementation in C++
Your Task
• Literature Research: Parallelization strategies for building indexstructures
• Understanding of the Elf and its optimization concepts
• Implementation different parallelization strategies for Elf
• Performance evaluation against standard implementation28 / 37 Scientific Project David Broneske et al.
Topic 9 - GPU-Accelerated Selections in Elf
Intro
• Elf: multi-dimensional main memory index structure for efficient selections
• Stores data sorting in a multi-dimensional order
• Multi-core architectures demand for a clever parallelization strategy forGPUs
We’ve got
• Elf implementation in C++
• Parallel search and insert for CPU
Your Task
• Literature Research: Related parallelization strategies for index structures
• Understanding of the Elf and its optimization concepts
• Implementation GPU traversal variants for Elf
• Performance evaluation against serial/CPU-parallel implementation
29 / 37 Scientific Project David Broneske et al.
Finding your Team
Topics:
• Topic 1 - Fragment Skipper (Gridformation)
• Topic 2 - Deep hashing (Sim-Skip)
• Topic 3 - Graph processing in a neural network (Malko)
• Topic 4 - Cross-device data parallel query processing
• Topic 5 - Evaluating GPU-based Execution Models
• Topic 6 - Indepth Semi-Structured Data Storage
• Topic 7 - Order-By Queries on Elf
• Topic 8 - Parallel Build of Elf
• Topic 9 - GPU-Accelerated Selections in Elf
When do we meet for the programming test?
30 / 37 Scientific Project David Broneske et al.
Literature Research
How to Perform Literature Research
• Efficient literature research requires
Knowledge of Where to search
Knowledge of How to search
Finding adequate search terms
Structured review of papers
Knowledge of how to find information in papers
32 / 37 Scientific Project David Broneske et al.
Where to Search (I)
• Different websites available that provide large literature databases
1. Google Scholar: http://scholar.google.de/
Key word and conrete paper search
Often, PDFs are provided
2. DBLP: http://www.informatik.uni-trier.de/~ley/db/
Search for keyword, conferences, journals, author(s)
BibTex and references to other websites
3. Citeseer: http://citeseerx.ist.psu.edu/about/site
keyword, fulltext, author, and title search
BibTex and (partially) PDFs are provided
33 / 37 Scientific Project David Broneske et al.
Where to Search (II)
• Publisher sites are also a suitable target
• ACM Digital Library: http://portal.acm.org/dl.cfm
Keyword, author, conference/literature (proceedings), and titlesearch
Bibtex, mostly PDFs and other information are provided
• IEEE Xplore: http://ieeexplore.ieee.org/Xplore/guesthome.jsp?reload=true
Similar to ACM, but only few PDFs
Extended access within university network
• Springer: http://www.springerlink.de/
Similar to previous
Extended access within university Network
• Further search possibilities: on author, research group or universitysites34 / 37 Scientific Project David Broneske et al.
How to Search
Some hints to not get lost in the jungle
• Use distinct keywords (fingerprint vs. fingerprint data)
• Keep keywords simple (at most three words)
• Otherwise, search for whole title
• Read abstract (and maybe introduction) ⇒ decision for relevance
First insights
• Read abstract, introduction and background/related work(coarse-grained) to
. . . get a first idea of the approach
. . . find other relevant papers
35 / 37 Scientific Project David Broneske et al.
Information Retrieval
Finding the required information
• Read the paper carefully
• Omit formal parts/sections
• Try to classify (core idea, main characteristics) ⇒ developclassification/evaluation in mind
• Understand the big picture
• Make notes
• Do NOT translate each sentence
36 / 37 Scientific Project David Broneske et al.
Finding your Team
Topics:
• Topic 1 - Fragment Skipper (Gridformation)
• Topic 2 - Deep hashing (Sim-Skip)
• Topic 3 - Graph processing in a neural network (Malko)
• Topic 4 - Cross-device data parallel query processing
• Topic 5 - Evaluating GPU-based Execution Models
• Topic 6 - Indepth Semi-Structured Data Storage
• Topic 7 - Order-By Queries on Elf
• Topic 8 - Parallel Build of Elf
• Topic 9 - GPU-Accelerated Selections in Elf
37 / 37 Scientific Project David Broneske et al.