big data praktikum ss 2018 - uni-leipzig.dedbs.uni-leipzig.de/file/intro_bdprak_final.pdf ·...
TRANSCRIPT
Big Data Praktikum
SS 2018
Universität Leipzig, Institut für Informatik
Abteilung Datenbanken
Prof. Dr. E. Rahm
Ziel: Entwurf und Realisierung einer Anwendung / eines
Algorithmus unter Verwendung existierender Big Data
Frameworks
Ablauf
Anwesenheitspflicht der Gruppe zu allen Testaten
Bis Anfang Mai Erstes Treffen mit Betreuer (Terminanfrage per Mail)
Ende Mai Testat 1: System kennenlernen / Datenimport / Lösungsskizze
Mitte/Ende Juli Testat 2: Implementierung und Ergebnisse vorstellen
Anfang August Testat 3: Präsentation
15 Minuten pro Gruppe
Anwesenheitspflicht aller Praktikumsteilnehmer
Organisation
Quellcode: GitHub Repository Gruppe => Collaborators
Werden nach Praktikum zu https://github.com/leipzig-bigdata-lab geforked
Java: Apache Maven 3 für Projekt Management
Test Driven Development erwünscht Siehe Dokumentation zu Unit Tests in jeweiligen Frameworks
Quellcode Dokumentation zwingend erforderlich!
Stabile Versionen verwenden (ggf. Rücksprache) z. B. Flink 1.4.2
Lokal lauffähige Lösungen können auf dediziertem Cluster
ausgeführt werden Terminabsprache Anfang Juli mit [email protected]
Datensätze
z. B. https://github.com/caesar0301/awesome-public-datasets
Technische Details
PPRL with Bloom Filters
Projects:
1.) Analyzing different BitSet Implementations for Bloom-Filter-based PPRL
2.) Analyzing different lengths of Bloom Filters
3.) Analyzing XOR-Folding for Bloom Filters
Privacy-Preserving Record Linkage (PPRL)
Find records in different databases that refer to
the same real world object
No disclosure of sensitive personal information
BitSet Implementations for Bloom Filters
Problems: Different BitSet implementations usable as basis for Bloom Filter
java.util.BitSet
OpenBitSet
boolean[]
No or outdated benchmarks
Task: Development of three Bloom Filter implementations
Performance benchmark (runtime, memory)
Proof of claims, e. g. “OpenBitSet is faster than java.util.BitSet in most operations
and *much* faster at calculating cardinality of sets and results of set operations.”
Technologies: Java
Apache Flink
Lengths of Bloom Filters
Problems: PPRL Applications use given length of Bloom Filters for encoded records (usually
1000)
Better performance is expected with shorter Bloom Filters
But how does the length of the Bloom Filter effect the quality of the results?
Task: Encoding of given data sets with different parameters:
Lengths of Bloom Filters
Number of hash functions
Evaluation of quality (recall, precision) of PPRL processing based on parameters
Exploration of practical boundaries
Technologies: Java
Apache Flink
XOR-Folding for Bloom Filters
Problems: Goal of PPRL is to hide personal data in the matching process by encoding the
fields in a Bloom Filter. BUT some cryptanalysis methods can disclose original data
The main weakness of Bloom Filters is the frequency of some tokens (“er” is a
frequent bigram many bloom filters will have some same position set to 1).
Is it possible to hide or obfuscate these frequencies by XOR-folding the Bloom
Filter?
How is the impact of the folding operation on the linkage quality?
Task: Implementation of some folding operations
Evaluation of quality (recall, precision)
Technologies: Java
• https://github.com/IIDP/OSTMap
• OSTMap development started as a project at the IT-Ringvorlesung 2016.
• A team of six students (and some help of two big data experts) implements
OSTMap over a period of 6 weeks.
• OSTMap reads geotagged data from the twitter stream.
• We store tweets in a hadoop cluster running Apache Accumulo and Apache Flink.
OSTMap - Open Source Tweet Map
Efficient Termindex for Twitter Data and Trend Visualization
• Part 1:
• Currently the term search supports lookups for exactly one term eg. „bigdata“
• We want to support fast queries like: „the“ „white“ „house“
• Key word: Document-Partitioned Indexing
• Part 2:
• We want to visualize current trends…
• With their geographic distribution and
• Their temporal spread.
Sentiment Analysis for Twitter Data
• Part 1:
• Use of Java-based libraries for in-stream sentiment analysis of twitter data
• Batch-based sentiment analysis, e.g. with SparkMLlibs Naïve BayseClassifier
• Write data to a table for sentiment analysis results for each approach
• Part 2:
• Build a frontend in OSTMap for users to decide the sentiment of randomly drawn tweets are done
• Use the information for quality analysis of sentiment analysis procedures and visualize the results in OSTMap
Polyglot DB
• Verschiedene Anwendungen erfordern versch. Typen von Datenbanken: Relational, Key-Value, Document, Graph, …
• In der Praxis: Gleichzeitige Verwendung versch. Typen
• Vorteil: Optimale DB für jeden Anwendungsfall
• Beispiel: • Relational: Sicherheit, homogene Daten
• Key-Value: Schneller Zugriff, einfache Datenstruktur
• Document: Flexibles Schema, Suchfunktionen
• Graph: Beziehungen, Traversal
• Aufgabe: Welchen Vorteil hat die Verwendung einer Graphdatenbank gegenüber einer Dokumenten-Datenbank?
Anwendung
• Yelp Dataset
• Dokument-DB: MongoDB
• Infos zu Unternehmen
• Speicherung der Reviews
• Suche nach Kategorie
• Geospatial Query
• Empfehlungen: ähnliche Restaurants, z.B. Welche Restaurants wurden vom selben Reviewer gleich gut/schlecht bewertet?
• Graphdatenbank (Neo4j) schneller als MongoDB?
• Trotz Synchronisation?
• Kausale Konsistenz ist Kompromiss zwischen sequentieller Konsistenz und Eventual Consistency
• Reihenfolge der Operationen wird eingehalten, aber beschränkt auf kausal
verbundene Operationen (happened-before relation)
• Weniger Koordination erhöht Verfügbarkeit• Stärkste Konsistenz, welche Verfügbarkeit (insb. Schreib-Operationen)
trotz Netzwerkpartitionierung erlaubt• Nur wenige NoSQL-DB unterstützen kausale Konsistenz• Bolt-on = Clientseitige Umsetzung• Paper: Bailis et al (2013), http://www.bailis.org/papers/bolton-
sigmod2013.pdf• Prototype (github): Java, Cassandra
Aufgabe• Umsetzung mit JavaScript, PouchDB und CouchDB
Bolt-on causal consistency
Creation and visualization of temporal graphs
[1] Aynaud, Thomas & Fleury, Eric & Guillaume, Jean-Loup & Wang, Qinna. (2013). Communities in Evolving Networks: Definitions, Detection, and Analysis
Techniques. Modeling and Simulation in Science, Engineering and Technology. 2. 159-200. 10.1007/978-1-4614-6729-8_9.
„Graphs are everywhere“: friendship networks on Facebook, community
interactions at Stackoverflow, video-likes and channel-abo‘s on YouTube,
citation networks
Real-world graphs change over time – additions, deletions and updates of
edges, vertices and their properties
Much work done to analyse and visualize static graphs
„How communities evolve over a specific time range?“
„At which time the number of citations is growing rapidly? Did other citations
influence that?“
Creation and visualization of temporal graphs
Tasks
Create a temporal EPGM from a network dataset
Query graph data by time range
Visualize the graph in an interactive web application
[2] A. Beveridge and J. Shan, „Network of Thrones“ Math Horizons Magazine , Vol. 23, No. 4 (2016), pp. 18-22.
[3] Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. "Motifs in Temporal Networks." In Proceedings of the Tenth ACM International Conference on Web Search and
Data Mining, 2017.
Now1990
Stackoverflow
temporal network
dataset:
2,601,977 nodes
63,497,050 edges
Size of Graph
[3]
FastText on SparkWord2Vec
• Words are represented by a vector
• Trained by a large corpus considering context of words
Skip-gram Model
FastText on SparkIssues
• Unknown words in test corpus Missing fuzzy component
Solution
• FastText
• Using n-gram sequences for representing words
• Utilized to generate embeddings even for words that are not included in the
vocabulary
Task
• Understanding FastText
• Representation of words
• Neural Network
• Implementation with DeepLearning4j
Distributed FastText on TensorFlowIssues
• Unknown words in test corpus Missing fuzzy component
Solution
• FastText
• Using n-gram sequences for representing words
• Utilized to generate embeddings
Task
• Distributed Implementation of FastText on TensorFlow
Farberkennung von Produkten
• Zur Identifikation von Duplikaten in Produktkatalogen können Bilder sehr hilfreich sein
• Ziel:
• Extraktion der Farbinformation von Produktbildern
• Segementierung und Annotation von Vorder und Hintergrund – ggf- andere Kategorien
• Technologie:
• Convolutional Neuronal Networks (nutzbar z.B. über TensorFlow)
• Daten
• 90000 Produktbilder der WebDataSolutions GmbH
Analytics ofBitCoin Transaction Data
• Parsen der Bitcoin Blockchain
• Verarbeitung von Updates durch neue Transaktionen
• Erstellung eines Graphen in Gradoop
• Analyse mittels Gradoop
• Max 2 Studenten
• mit guten Java Programmierkenntnissen
• Flink-Erfahrung oder VL Cloud Data Management
als Voraussetzung
Analytical
Workflows
Webgraph Analysis with GRADOOP
• commoncrawl.org: three-monthly snapshots of web graph on host-level
• Questions:
• How is the{University of Leipzig, Bach Digital project} interlinked with other institutions and research projects?
• How did this change over time?
• Are there interesting structures or missing links (e.g. triangle closing)?
• Tasks:
• Data Import to GRADOOP, Preprocessing
• Data exploration
• Development of analytical questions
• Data Analysis with GRADOOP operators
• Visualization / Reporting
Thema FW #Studenten Betreuer
PPRL: Analyzing different BitSet
Implementations for Bloom-Filter-based
PPRL
Java / Apache Flink2
Franke
PPRL: Analyzing different lengths of Bloom
FiltersJava / Apache Flink 2 Gladbach
PPRL: Analyzing XOR-Folding for Bloom
FiltersJava / Apache Flink 2 Sehili
OSTMap: Efficient Termindex for Twitter
Data and Trend Visualization
Java / Apache Flink /
Apache Accumulo /
JavaScript
2Grimmer
OSTMap: Sentiment Analysis for Twitter
Data
Java / Apache Flink /
Apache Accumulo /
JavaScript
2Kricke
Creation and visualization of temporal
graphs
Java / Apache Flink /
Gradoop / JavaScript2 Rost
Polyglot DB Java, MongoDB, Neo4j 2 Zschache
Bolt-on causal consistencyJavaScript, CouchDB,
PouchDB2 Zschache
FastText on Spark Spark, DeepLearning4j 2 Christen
Distributed FastText on TensorFlow TensorFlow 2 Alkhouri
Farberkennung von Produkten TensorFlow 2 Peukert
Analysis of the BitCoin-BlockchainJava / Apache Flink /
Gradoop2 Peukert
Webgraph AnalysisJava / Apache Flink /
Gradoop / (JavaScript)2 Wilke