![Page 1: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/1.jpg)
Big Data PraktikumAbteilung Datenbanken
Sommersemester 2017
![Page 2: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/2.jpg)
Orga
Ziel: Entwurf und Realisierung einer Anwendung / eines Algorithmus unter Verwendung existierender Big Data Frameworks
Ablauf
Anwesenheitspflicht der Gruppe zu allen Testaten
Bis Anfang April Erstes Treffen mit Betreuer (Terminanfrage per Mail)
Ende Mai Testat 1: System kennenlernen / Datenimport / Lösungsskizze
Mitte/Ende Juli Testat 2: Implementierung und Ergebnisse vorstellen
Anfang August Testat 3: Präsentation
15 Minuten pro Gruppe
Anwesenheitspflicht aller Praktikumsteilnehmer
![Page 3: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/3.jpg)
Technische Details
Quellcode: GitHub Repository Gruppe => Collaborators
Werden nach Praktikum zu https://github.com/leipzig-bigdata-lab geforked
Java: Apache Maven 3 für Projekt Management
Test Driven Development erwünscht Siehe Dokumentation zu Unit Tests in jeweiligen Frameworks
Quellcode Dokumentation zwingend erforderlich!
Stabile Versionen verwenden (ggf. Rücksprache) z.B. Flink 1.1.2
Lokal lauffähige Lösungen können auf dediziertem Cluster ausgeführt werden
Terminabsprache Anfang Juli mit [email protected]
Datensätze https://github.com/caesar0301/awesome-public-datasets
![Page 4: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/4.jpg)
Is the globe really warming?Yin-Chi Lin
![Page 5: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/5.jpg)
Mr Trump's top advisers are currently divided on the issue (Paris climate agreement), with some, including Environmental Protection Agency head Scott Pruitt, eager for the US to leave the deal. "Paris is something that we need to really look at closely, because it's something we need to exit, in my opinion," Mr Pruitt said in an interview with Fox News Channel's "Fox & Friends" last week. "It's a bad deal for America. It was an America second, third or fourth kind of approach."
Is the globe warming? • If yes, since when and at what magnitude?
• Are there regional differences (e.g. between different continents, countries, climate zones …)?
• Are there seasonal differences?
• Is the rise of temperature really correlated to the increase of CO2 emission?
• “Europe’s Atlantic-facing countries will suffer heavier rainfalls, greater flood risk, more severe storm damage and an increase in “multiple climatic hazards…”
…….
IS THE GLOBE WARMING? Tuesday 18 April 2017
![Page 6: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/6.jpg)
Data Sources & Tools
GHCN-Daily dataset (Global Historical Climatology Network):
• 1763-2017• more than 100,000 stations across the globe
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/
https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf
Global CO2 Emissions from Fossil-Fuel Burning, Cement Manufacture, and Gas Flaring:
• 1751-2014
http://cdiac.ornl.gov/ftp/ndp030/global.1751_2014.ems
Tools:SparkR + Map visualization tool
![Page 7: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/7.jpg)
Analysing Metabolic Networks in Gradoop
Anika Groß
![Page 8: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/8.jpg)
Analysing Metabolic Networks in Gradoop
• Modellierung metabolischer Netzwerke und biochemischer Reaktionen im EPGM (Extended Property Graph Modell)
• Transformation und Import in Gradoop• Daten von http://bigg.ucsd.edu
[1] Lanzenia, Messinaa, Archettia: Graph models and mathematical programming in biochemicalnetwork analysis and metabolic engineering design, Computers & Mathematics with Applications, 2008.[2] Junghanns, Petermann: Verteilte Graphanalyse mit Gradoop. JavaSPEKTRUM 05/2016.
[2]
http://bigg.ucsd.edu
[1]
• Datenanalyse: „Hub“-Moleküle, Suche nach Mustern, Finden häufiger Subgraphen, …
![Page 9: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/9.jpg)
Analyzing PanamaPapers with Gradoop
Eric Peukert
![Page 10: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/10.jpg)
Analyzing Panama Papers with Gradoop
• Loading Panama Papers with Gradoop (Neo4J-Connector or from CSV)
• Viszalize Schema
• Implement analytical workflows
• Optional: link with additional sources in Germany such as people/companies from dbpedia
![Page 11: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/11.jpg)
Analytics of Development Project Data
Eric Peukert
![Page 12: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/12.jpg)
Analytics of Development Project Data
https://issues.apache.org/jira/rest/api/2/project
Analytical Workflows
![Page 13: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/13.jpg)
Analysis of LOD datasets within GradoopMarkus Nentwig
![Page 14: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/14.jpg)
Linked Open Data
• Structured, interlinked data using standardtechnologies• HTTP dereference entities
• RDF machine-readable data exchange format
• URIs identification of entities
![Page 15: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/15.jpg)
Gradoop: Distributed Graph Analytics
- Use graph operators like aggregation, grouping or subgraph to analyse data
- Ontop of Apache Flink- Extended Property Graph Model- Support for different data sources like
- CSV, JSON, …- Currently missing: RDF
![Page 16: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/16.jpg)
Tasks:
Data SourceGradoopAnalysis
Data Sink
- Implement data source and data sink for RDF data format
- Based on existing data sources- Import/export LOD data set- Handle RDF reification
- Analyze a given dataset with simple Gradoopoperators
![Page 17: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/17.jpg)
Analytics of Publication Data with GraphuloMatthias Kricke
![Page 18: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/18.jpg)
Analytics of Publication Data with Graphulo
Technologies• Graphulo, which is based on
• Apache Accumulo (Distributed DBS)• Apache Hadoop HDFS (Distributed
Filesystem)
Data• DBLP
• open bibliographic information on major computer science journals and proceedings
Task• Import DBLP into graphulo• Analyze DBLP by the means of graphulo
• Graph diameter?• Size of biggest connected component?• …
Cluster Environment
![Page 19: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/19.jpg)
Classification of program traces using TensorFlow or Caffe
Martin Grimmer
![Page 20: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/20.jpg)
Classification of program traces usingTensorFlow
• A program trace is the sequence of system calls of a program.
54 175 120 175 175 3 175 175 120 175 120 175 120 175 175 120 175 3 3 3 175 120 175 175 175 7 3 3 175 120 175 7 175 7 119 174 54 3 3 175 175 3 120 175 175 120 175 120 120 175 175 54 140 3 175 120 175 175 175 175 175 174 7 175 7 119 3 3 175 3 175 175 120 175 7 175 3 175 120 175 175 54 7 174 3 175 120 7 175 175 120 175 175 3 175 120 175 3 3 120 175 120 175 175 7 54 175 120 175 7 175 7 119 174 54 3 120 175 175 120 54 3 120 175 175 54 140 175 175 174 54 175 120 175 175 54 140
• TensorFlow is a open source library for artificial intelligence.• https://www.tensorflow.org/
• The task:Build a classifier with TensorFlow that learns what is normal.
-> One class classification problem!
Use this classifier to test unknown system traces for
abnormal behavior.
![Page 21: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/21.jpg)
Speed up Entity Resolution with Bit Arrays
Ziad Sehili
![Page 22: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/22.jpg)
Entity Resolution
Jaccard similarity= # Intersection_tokens / # union_tokens = 7/10
Build tokens: trigrams(tommas schmidt)={tom, omm, mma, mas, sch, chm, hmi, mid, idt}trigrams(tomas schmidt) = {tom, oma, mas, sch, chm, hmi, mid, idt}
Find records in different databases that refer to the same real world object
![Page 23: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/23.jpg)
Entity Resolution With Bit Arrays
tom, omm, mma, mas, sch, chm, hmi, mid, idt
0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
tommas schmidt
tomas schmidt
h
0 1 0 0
0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 1 0 0
tom, oma, mas, sch, chm, hmi, mid, idt
h
Jaccard similarity= AND / OR = 7/9
Problems:1. How to get similar/same quality as string comparison? (length of bit array to avoid collisions 303???) or increase the number of hash functions!!!
2. Does this method improve the runtime?
![Page 24: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/24.jpg)
Parameter Tuning for Entity-Resolution Problems
Victor Christen
![Page 25: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/25.jpg)
Parameter Tuning for Entity-Resolution Problems
• Quality depends on the similarity function sim_f• Determined by compared attributes and weights for each attribute
combination
attribute
Name
Description
Price
0.7
0.4
0.9
𝑠𝑖𝑚_𝑓
0.6
0.4
0.5
𝑠𝑖𝑚_𝑓
![Page 26: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/26.jpg)
Classification problem• Entity Resolution as
classification problemTask
• Determine logistic regression classifier based on given similarity vectors and a training data set
• Evaluate different training data set sizes by determining quality, variance,…
Advanced
• Investigate the impact of different classifiers according to similar vectors
Technology
• SparkML• Logistic Regression, K-Means
name
de
scri
pti
on
![Page 27: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/27.jpg)
Graphbased Similarities for medical concepts
• Relatedness determined by using an knowledgebase such as an ontology• Ontologies represent the backbone of the Semantic Web
• Structure knowledge by defining concepts and relations between concepts, such as “Heart infarction”, “diabetes mellitus”,…
• Hierarchical structure of concepts
Related?
![Page 28: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/28.jpg)
Concept Similarity
• Similarities based on basic measures
B
E
A
F
C
H
D
KG I L
Measure Concept(C)
#subsumer 2
#leaves 3
#leaves
5
Local depth-first search based implementation needs more than a day up to weeks!!!!
Measure Concept(C)
#subsumer 2
#leaves 3
Measure Concept(C)
#subsumer 2
#leaves 3
Measure Concept(C)
#subsumer 2
#leaves 3
Measure Concept(C)
#subsumer 2
#leaves 4
disease
isa
Heart infarction
general
specialized
![Page 29: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/29.jpg)
Task & Requirements
Data
• Extracted directed acyclic graph (DAG) from the Unified Medical Language System• 2.2 Mio Concepts, 2.9 Mio Relations
Task
• Parallel traversal algorithm to determine the measures
Technology
• Apache Flink/Gelly
![Page 30: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec579946f458307fe416c2b/html5/thumbnails/30.jpg)
ThemenübersichtThema FW #Studenten Betreuer
Is the globe really warming? SparkR 2 Lin
Analysing Metabolic Networks in Gradoop
Gradoop/Flink 2 Groß
Analyzing PanamaPapers with Gradoop Gradoop/Flink 2 Peukert
Analytics of Development Project Data Gradoop/Flink 2 Peukert
Analyse LOD datasets within Gradoop Gradoop/Flink 2 Nentwig
Analytics of Publication Data with Graphulo
Apache Accumolo 2 Kricke
Classification of program traces using TensorFlow or Caffe
TensorFlow or Caffe/Python/C++
2 Grimmer
Speed up Entity Resolution with Bit Arrays
2 Sehili
Graph-based Similarities for medical concepts
Flink 2 Christen
Parameter Tuning for Entity-Resolution Problems
SparkML 2 Christen