from data bases to big data (7 lectures) mbds graduate...
TRANSCRIPT
![Page 1: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/1.jpg)
www.mbds-fr.org
« From data bases to big data »(7 lectures)
MBDS graduate course
Professor Serge Miranda
Dept of Computer Science
University of Nice Sophia Antipolis
( menber of Universite Côte d’Azur –UCA-)
Director of MBDS Master degree
(www.mbds-fr.org)
1
![Page 2: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/2.jpg)
www.mbds-fr.org
BIG DATA : N.O. SQL and NEW SQL
(Lecture 7)
Professor Serge MIRANDA
Dept of Computer Science
University of Nice Sophia Antipolis (UCA)
Director of MBDS Master degree
(www.mbds-fr.org)
(www.mbds-fr.org) 2
![Page 3: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/3.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
BIG DATA management systems
➢TOP DOWN approach for structured and semi-structured DATA ➢SQL2, SQL3, ODMG
➢Semantic Web (SPARQL, OWL)
➢BOTTOM UP Approach for UNSTRUCTURED DATA ➢N.O. SQL (NOT ONLY SQL)
➢NEWSQL
![Page 4: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/4.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Bottom up approach for unstructured data (no schema, no metadata)
➢ « N.O. SQL » (Not Only SQL) < meaning NO Relational>
➢ « KEY /VALUE Paradigm »
➢ GRAPH paradigm
➢ « NEW SQL »➢ « SQL paradigm »
![Page 5: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/5.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« COMPLEX » data :SQL3, N.O. SQL et NEWSQL?
(MIRA2013)
OR-DBMS
PROCESSING
SQL
noSQL
OO-DBMS
SQL3
ODMG
ComplexStructured data
Top Down(schema)
ComplexUnstructured data
Bottom Up(no schema ; no metadata)
DATA STRUCTURE
N.O. SQL
NEW SQL
![Page 6: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/6.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« BASE » properties
➢BASE :➢ Basically
➢ Available
➢ Scalable (OUT)
➢ Eventually consistent (final consistency)➢Replica consistency ; Cross Node Consistency
➢ CAP Theorem(Eric Brewer, Prof Berkeley, 2000 & 2012 ; Revised by Altend MIT, 2002)
➢ Consistency, SQL➢ Availability,
➢ Partitioning NO SQL
![Page 7: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/7.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
CAP Theorem : « Pick 2 ! » (Brewer 2000 ; 2012)
![Page 8: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/8.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
N.O. SQL (Not Only SQL)
4 « no » :1. no SCHEMA (schema-less ; Variability) & NO METADATA2. no RELATIONAL/ NO JOIN (extract data without joins)3. no DATA FORMAT(graph, document, row, column)4. no (ACID)Transactions (CAP theorem ; BASE)
(1998)
+
(VALUE)…+
VOLUME
+
VELOCITY
+
VARIETY
![Page 9: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/9.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
2 Complementary approachsfor big data management
SQL N.O.SQL
VOLUME & VARIETYSTRUCTURED (SCHEMA)TERA/PETA bytes
Unstructured (no schema)EXA/ZETA++ bytes
VELOCITY NO YES
TRANSACTIONS YES (ACID and Gray’s theorem)
NO (BASE & CAP theorem)
SCALABILITY UP (Scale up) OUT (scale OUT)
USER INTERFACE AD HOC Queries, JOIN & Transaction oriented
Predefined queries, NO JOIN & Decision oriented
STANDARDS SQL3/ODMG Not yet (BIG SQL)
Typical approach TOP DOWN (predefined Schema)
Bottom UP (no schema)
Administrator Yes No
Vendor support Yes No (Open Source)
![Page 10: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/10.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
N.O. SQL and Web actors
➢ Google➢ Map Reduce, BigTable & BIG QUERY SQL (Data/ANALYTICS as a Service)
➢ Yahoo!➢ Hadoop, S4
➢ Amazon ➢ Dynamo, S3
➢ Facebook ➢ Cassandra, Hive
➢ Twitter : Storm, FlockDB➢ LinkedIn : Kafka, SenseiDB, Voldemort, etc.
![Page 11: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/11.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Taxonomy of BIG DATA Systems
![Page 12: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/12.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« N.O. SQL » DBMS
4 data paradigms: 3 KEY-VALUE oriented and one GRAPH oriented➢ KEY-VALUE with BLOBS (Binary Large Objects)
ex : Hadoop, Cassandra, Ryak, Redis,DynamoDB, BerkeleyDB, etc.➢➔ HASHING arrays (no query engine)
➢ KEY-VALUE with JSON/XML documentsex : MongoDB, CouchDB, etc. ➢ JSON simpler than XML with Java Script interface
➢ <KEY, VALUE> model with VALUE in JSON (BSON, XML) for documents ;
➢ KEY–VALUE with COLUMNSex : HBASE, Cassandra, BigTable/Google,…➢ <KEY, (SETofcolumns, VALUE, TIMESTAMP)>
➢ GRAPH orientedex: Neo4j, OrientDB… : towards GQL (Graph Query Language)
![Page 13: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/13.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
KEY-VALUE NOSQL and SQL convergence
➢(KEY, VALUE) pairs
➢(primary key, relational TUPLE)
➢Like (OID, VALUE) for OBJECTS
➢Hashing tables for access
In SQL : Create table PAIR (KEY, varchar primary key, VALUE blob)
➢ 4 basic operators➢ INSERT/DELETE/UPDATE pair➢ FIND value for a key
➢Ex : Cassandra, Redis, Voldemort, Memcached, Riak, Dynamo (Amazon), CACHE (Intersystems), CouchDB, Redis, BIG TABLE, Berkeley DB,…
![Page 14: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/14.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
REST(REpresentational State Transfer)
➢2 communication modes for client –server➢RPC (Remote Procedure Call) which is connection oriented(TCP) ➢REST <Representational State Transfer> which is service oriented➢REST is based upon HTTP
➢DATA access facility
➢6 REST methods : GET, HEAD, PUT, POST, DELETE, OPTIONS
➢Restful NO SQL systems : COUCHDB, HBASE, NEO4J, RIAK...
![Page 15: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/15.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
SCALE OUT & SHARDING (fragmentation)
➢distributed DATA partitioning for parallel processing
➢SHARD KEY : key for data partitioning
![Page 16: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/16.jpg)
HADOOP ecosystem for the 1st « V » of BIG DATA (VOLUME)
![Page 17: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/17.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
HADOOP ecosystem around HDFS and MAP REDUCE :
PIG LATIN* (script) developped by Yahoo, HIVE (datawarehouse) by Facebook (HIVEQL), …
* PIG Latin with SQL operators : Join, Group By, Union but procedural approach for batch; like HIVE UDF (User defined
functions) are possible
![Page 18: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/18.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Hadoop (Map Reduce) ?
➢OPEN SOURCE (Apache Foundation) written in JAVA ;
➢HADOOP (Map-Reduce implementation) : Created by Doug Cutting from Yahoo (then Open Source)
➢HDFS
➢HBASE : oriented-column key-value data store
➢From GOOGLE :
➢Google Map Reduce (2004) ➔ HADOOP
➢Google Filesystem (GFS) 2003 ➔ HDFS
➢Google BIG TABLE (Distributed hashing table over GFS)➔ HBASE
➢HADOOP distributions (Linux initially)
➢Cloudera (Impala)
➢Hortonworks (Windows version)
➢ MAPR (HDFS centrics)
➢and for the cloud : EMR- Elastic Map Reduce- (Amazon, 2009)
![Page 19: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/19.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Hadoop Map-Reduce (MR)
➢Map-Reduce architecture consists with➢one JobTracker➢different TaskTracker in charge on executing map-reduce on each machine
➢With YARN, evolution of MR Architecture :➢one JOB TRACKER
➢RESOURCE MANAGER
➢ APPLICATION MASTER (AM) <not only Map Reduce>
➢ SCHEDULER
![Page 20: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/20.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
MAP REDUCE processing steps
DATA distribution approach instead of PROGRAM distribution (parellelization)
Creation of (KEY, VALUE) pairs…
4 Steps : ➢SHARDING/SPLITTING input data for parallel processing
➢Mapping BLOCKS to create values associated with keys(key, value)
➢shuffling (sorting) by keys
➢Reducing groups with an aggregate value for each key
![Page 21: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/21.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Typical Example for MAP REDUCE(words counting in a given text)
![Page 22: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/22.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Example * MAP REDUCE for JOINing 2 tables ? ☺
Pilot PL# PLNAME
1 Serge
2 Leo
FLIGHT1 F# PL# DC AC
AF100 1 Nice Paris
AF101 1 Paris Toulouse
AF104 2 Toulouse Lyon
* Example inspired from book « Big Data et Machine Learning » P.Lemberger et al, Dunod, 2018 < In French>
![Page 23: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/23.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
MAP (Option with coalescence of DATA INPUT+Splitting)
Sharding at the tuple level
Pilot 1 Serge
Pilot 2 Leo
FLIGHT1 AF100 1 Nice Paris
FLIGHT1 AF101 1 Paris Toulouse
FLIGHT1 AF104 2 Toulouse Lyon
MAP & Shuffling with KEY= PL#(sorting)
Pilot 2 Leo
FLIGHT1 AF104 2 Toulouse Lyon
Pilot 1 Serge
FLIGHT1 AF100 1 Nice Paris
FLIGHT1 AF101 1 Paris Toulouse
KEY=1; (KEY, tuple)
KEY=2; (KEY, tuple)
VALUEs
![Page 24: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/24.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
REDUCE (aggregation; tuple fusion) on each partition and final result…
JOIN ☺
1 Serge AF100 Nice Paris
1 Serge AF101 Paris Toulouse
2 Leo AF104 Toulouse Lyon
REDUCE
![Page 25: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/25.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Map reduce issues
➢ split/Shard size ? 64 MO (HDFS block size)
➢ KEY selection ?
➢ Processing with these2 functions MAP & REDUCE ?
➢Hadoop framework complexity ? ➢ Batch-oriented (days to validate a map reduce job)
➢Map and reduce coding complexity ! ➢ scarce implementation of Map & Reduce➢Use of SCRIPT language (PIG)
➢ use of SQL-like interface (HIVE, SPARK SQL)
![Page 26: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/26.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
HADOOP Ecosystem
➢ HDFS or API for other distributed FS like S3 (Amazon)
➢MAHOUT for Machine learning in Java
➢ SPARK (2014)with MLlib
➢ Zookeeper (scheduler), Oozie (job plan), Flume (data flow) , Rhadoop (for R developpers) , Sqoop (data transfer with R DBMS) and… Apache STORM (real-time big data …)
➢ ---------------- next : Interactivity (>> batch) + SQL (>> scripts)
➢ Impala : MPP SQL engine
➢ DRILL (with Zookeeper) like Big Query (Google with Dremel)
![Page 27: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/27.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
HIVEhttp://hive.apache.org/
➢ 2009 Facebook➢ HiveQL 1.0 (Feb. 2015)
Tabular model on HADOOP with SQL-like interface Transformation of HiveQL query into MAP REDUCE jobs
➢CREATE TABLE PILOT(PIL# INTEGER,PLNAME STRING,ADDR STRUCTURE (Street : String, City : String, Zip : INT))
➢Only EQUI JOINS (inner join, left join, right outer join)
![Page 28: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/28.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
DOCUMENT*-oriented NO SQL DBMS(KEY/VALUE with Value = JSON document)
➢MONGODB
➢COUCHDB (with N1QL)
Note : *DOCUMENT : generally JSON (BSON, XML)
![Page 29: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/29.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
MONGODB
Document-oriented
➢ « DOCUMENT » : set of « attribute-VALUE » pairsExample : { FirstN : ‘Serge’, status : ‘Professor’}
➢Tuple in Codd’s relational model➢Maximum size for a document : 16 MB➢Every document has a unique key (ObjectID : 96 B)
➢ « Collection » : Set of « documents »➢TABLE in the relational data model➢Example :
Db.createCollection(‘PILOTS’) < creation of Pilot collection>Show collections <to see available collections>Db.PILOTS.insert ({ Name: ‘JOHN’, status:’Steward’})<to insert documents>OID for each document in MongoDB
➢ « DATA STORE/ DATA BASE » = Set of COLLECTIONS
Native host language MONGODB : JAVASCRIPT
![Page 30: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/30.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
MONGODB (document and FIND operator)
➢MONGODB « document » :EXample: Document PILOT (with its OID)
{PIL#: ‘10’,
PILNAME: ‘Serge’,
ADDR: [‘Nice’, ’TOULOUSE’] ,
Flight:
[{F#:’IT100’, DC: ‘NICE’, AC: ‘TOULOUSE’}, {F#: ‘IT106’, DC: ’Toulouse’, AC:’PARIS’}]
Pilot : dB.creationCollection (« pilot »)
Pilot= db[pilot] <to get access to the collection>
➢FIND : db.collection-name.find (query, fields)
➢ Query to select documents : { KEY : Value, ..] or {KEY; $op:value] with op :AND, OR, NOT}
➢ Fields to select attributes: {field : boolean,…}
➢ Example : db.pilot.find({DC:’Nice’}) for p in pilot
![Page 31: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/31.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
MONGODB (UPDATE & REMOVE)
➢UPDATEDb.Collection-name. Update (query, update, options) with :➢ Update :
$set (field definition), $unset (delete field), $inc (incrementation),$push (append an array element),$pull (delete an array element)
➢ Options : ➢upsert (creation of a new document), ➢multi :true (modification of documents corresponding to the query)
➢REMOVEDb.Collection-name. remove (query, justone) ➢Justone:true to delete one document only➢Example : db.pilot.remove ({ADDR:’Nice’})
![Page 32: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/32.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
COUCH DB
➢ MEMBASE + COUCHbase ➔ COUCH DB
➢ Document–oriented (JSON document)
➢ Developped in ERLANG (Ericsson functional language)
➢ REST (REpresentation State Transfer) interface
➢ Eventual consistency (BASE)
➢ SQL-like interface with N1QL
➢NEST/UNNEST operators (for complex objects)
![Page 33: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/33.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
COUCH DB (N1QL example)
➢Example with JSON Pilots document encompassing flights documents with DC (departure city) ➢Then the query :« Select pilots living in Nice and insuring a flight from Nice ? » iswritten in N1QL :
SELECT PL#, PINameFrom PilotsWhere ANY f in flights SATISFIES f.DC= « Nice » and ADDRESS = ‘Nice’; *
NOTE : *”EVERY” instead of “ANY” would correspond to the division operator in the relational algebra while ANY corresponds to JOIN
![Page 34: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/34.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
COLUMN-oriented* NOSQL DBMS(KEY- VALUE paradigm)
KEY➔Column family (sub-table)➔column➔VALUE with timestamp
➢CASSANDRA (with CQL)
➢HBASE (Hadoop)
➢BigQuery (Google)
➢BerkeleyDB (Oracle)
➢…
* Decision-oriented
![Page 35: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/35.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
HBASE* (http://hbase.apache.org)
➢Google File System (2003) + MAP REDUCE ➔➢Apache HADOOP library
➢(Google) « BIG TABLE »
➢On top of HADOOP➢Column-based storage
➢Key-value access
➢Applications : Petabytes DB accross thousands of commodity servers➢Open TIME SERIES Data bases
➢USED by : FACEBOOK (like counts...), TWITTER (R/W back up ; time series), YAHOO (document fingerprint...)
* « HBASE essentials » N.Garg, Packt Publishing, Nov 2014
![Page 36: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/36.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
HBASE
➢Schema-less data base ➢Flexibility to store any data TYPE without defining it
➢Each row can store different columns
➢To insert data in a table : Put’<table name>’,<row number>, Colum_family : ‘key’, ‘value’
➢Get ‘<table name>’, ‘<row num>’
➢Scan ‘<table name>’
➢Describe ‘<table name>’
➢Delete ‘<table name>’ …
![Page 37: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/37.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
HBASE storage model : HFILE
➢Hfile ➔ storage unit for family of columns(like Memtable in Cassandra)➢« TABLES » and « ROWS » with « columns »
➢ Rows in HBASE ➔ ROW KEYS
➢« Column family » (various columns) ➔ HFILE (Key-value pairs)
➢DB update with WAL protocol (Write-Ahead log protocol)
![Page 38: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/38.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
CASSANDRA
➢COLUMN-ORIENTED
➢BIG TABLE from Google/Facebook + DYNAMO from Amazon➔CASSANDRA
Facebook then in 2008 DYNAMO developpers (from Amazon) and Microsoft then APACHE in 2009
DYNAMO : {(KEY,VALUE)} distributed with no centralized control
➢Customers : APPLE, NETFLIX, eBAY, INSTAGRAM… CALL of DUTY (>100 M of gamers)
➢NOSQL DB with JAVA API
➢CQL (SQL-like query Language)
➢No predefined schema (until 2014 and CQL3)
➢Every line could be different (referring to columns)
➢« column family » (then « tables ») represent storage units for keys (MEMTABLE file)
➢Each column encompasses a triple : NAME (UID possible), VALUE and TIMESTAMP
➢DATA update with WAL protocol (commit log)
![Page 39: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/39.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
CQL3 (since 2014) : Create TABLE…
➢Create TABLE < replacing column family > ; no foreign keys
➢Create alter /drop/use KEYSPACE (data base)
➢Create TRIGGER, CREATE INDEX ; EX : Create Index on flight (DC)
➢SELECT ..FROM..WHERE <primary and secondary keys> .. ORDER BY with LIMIT/ALLOW FILTERING clauses ; no GROUP BY
Example :
Create Table PILOTS (PIL# bigint primary key, PIlname TEXT, ADDRESS TEXT)
Create Table FLIGHT ( F# bigint,
Date-comment timestamp, <comment on the flight>
author varchar,
Content text,
PRIMARY KEY (F#, date-comment))
![Page 40: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/40.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
CQL3 (Sub queries)
➢ No query optimization➢ IN : « Semi Join »
➢ NOT IN : « anti join »
Generic example :
SELECT PILNAME
From PILOT
JOIN EACH FLIGHT ON Pilot.pil#=flight.pil# and DC =‘Nice’ ;
![Page 41: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/41.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
… and Google Big Query(https://bigquery.cloud.google.com) ?
3 major Google contributions to Big Data ecosystem :
1. BGFS (GOOGLE FILE SYSTEM) ➔ HDFS
2. BIG TABLE (column oriented) ➔ HBASE
3. MAP REDUCE ➔ HADOOP
41
![Page 42: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/42.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
DREMEL (Google) 2006
➢BIG TABLE (on GFS) ➔
DREMEL (on COLOSSUS)➢Distributed Query engine (Scale OUT)
➢Column-oriented DB
➢BIGQUERY SQL interactive interface (proprietary) with NEST/UNNEST➢SCHEMA
➢CLOUD-based approach
➢Note : GMAIL is built on top of DREMEL
➢Parallel execution on thousands of machines➢ >n (100 000 disks)
➢ > p (10 000 processors) < SCALE OUT>
➢ 50 giga Bytes/sec with response time < 5 sec ➢<VIRTUAL CLUSTER unit>
![Page 43: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/43.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Wikipedia Benchmarkwith Big Query in 2012
➢2011 : Wikipedia bigquery-sample : wikipedia_benchmark
➢[Examples Bigquery SQL from TIGANI2014]
SELECT language SUM (views) AS views
From [bigquery_samples : wikipedia_benchmark<size] <size from1K to ½ Tera>
WHERE REGEXP_MATCH (title, *G.*O.*O.*G.)
GROUP By Language
ORDER BY Views Dsc
![Page 44: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/44.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Counting the word « KING » in SHAKESPEARE‘s plays with Big Query
(publicdata:samples.shakespeare)<1,6 GB
SELECT LOWER (Word) AS word, word_count AS frequency, corpus
FROM [publicdata:samples.shakespeare]
WHERE corpus CONTAINS ‘king’ AND LENGTH (Word) > 5
ORDER BY frequency DESC
LIMIT 10
![Page 45: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/45.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Some SQL examples on KEY-VALUE NO SQL DBMS with N1QL (Couchbase), CQL (Cassandra), …
Typical Example :
N1QL :SELECT PIL#, PILNameFrom PilotWhere ANY F in Flight SATISFIES F.DC= ‘Nice’ and ADDR = ‘Nice’;
CQL3SELECT PIL#, PILNAME
From PILOTJOIN EACH FLIGHT ON Pilot.pl#=FLIGHT.PL# and DC =‘Nice’ and ADDR = ‘Nice’;
![Page 46: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/46.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Graph-oriented NO SQL :NEO4J (2000)
➢J : Java (NEO4J was developped in Java)
➢Implementation and management of GRAPHs
(Node, Relationship, Property)
➢ DATA storage in a directory with REST interface
➢2 specific languages : CYPHER (SQL flavour) and GREMLIN (script language based on Groovy )
➢Example : Node creation with CYPHERCreate n={pilname: ‘serge’, ADDR :’Nice’} <first node takes number 1>
➢Applications : Walmart, Cisco, Twitter
![Page 47: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/47.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
NEO4J
➢START Serge=node(1)Create flight100={f#:’IT100’, DC=‘Nice’, AC=‘Toulouse’} Serge-[r:insure]➔ flight100 Return flight10
➢START Serge=node(1)MATCH Serge-[r:insure *]➔ Node (2) Return r
*indefinite number of hierarchical paths
![Page 48: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/48.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
NEO4J (MATCH clause)
➢Match clause :START flight=node:flight (DC=‘Nice’)
MATCH passenger-[:LIKE]->flight
RETURN passenger
➢Start (query)START pilot=node:pilot(pilname= ‘serge’)
RETURN pilot
MATCH (p:pilot)
USING INDEX p:pilot(ADDR)
WHERE p.pilname = ‘serge‘RETURN p
![Page 49: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/49.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Towards GQL (Graph Query Language)*
➢ for graph-based NO SQL systems
➢There are planned extensions to SQL for graph queries ➢Neo4j with Oracle, Microsoft, IBM and SAP with CIPHER ;
<limited enhancements for SQL>➢ The property graph data model is a superset of the tabular SQL model, ➔ to have a graph query language, GQL, that
complements SQL.
➢*Proposed standard to SQL Committe, Alastair Green 30 may 2018 https://db-engines.com/en/blog_post/78
49
![Page 50: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/50.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Benefits of Fusing Three Languages into One Standard GQL (https://db-engines.com/en/blog_post/78)
➢There are three existing pure graph languages which use a shared « graph pattern » for inserting, updating or extracting data from a property graph (comparison in gql.today)
➢PGQL comes from Oracle PGX (first appearing in 2016)
➢openCypher started out in Neo4j’s graph database in 2011 and is now used in other commercial products
➢G-CORE is a research language, described in a SIGMOD 2018 paper (LDBC Query Language task force)
https://db-engines.com/en/blog_post/78
![Page 51: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/51.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Towards GQLinitiating an industry standard graph query language
GQL Manifesto (gql.today) :
SELECT
FROM GRAPH
MATCH < Graph Pattern : sub graph>
WHERE
![Page 52: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/52.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Example with GQL
Query : Name of the pilots who insure a flight from Nice with a Boeing 747 ?
SELECT
pl.name AS pilot.name
FROM GRAPH pilotflightsplanes
MATCH// graph pattern
(pl:Pilot)->{:INSURES}->(f:flight)<-{:IS-USED}<- (p:plane)
WHERE
p.name = ’B747’ and f.DC = “Nice”;
➢The pattern means that all data in the graph that matches the sequence of nodes and edges(each of which has a particular « label » or element type) will be identified.
➢This operation lifts a «sub-graph » or a « projected graph » of flights for a particular pilot into the application.
➢Properties on all instances of :PILOTnodes or :FLIGHTedges that match can now be read by the application.
![Page 53: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/53.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
NEW SQL
« Replacing real SQL ACID with either no ACID or« ACID lite » just pushes consistency problems into the
applications where they are far harder to solve. Second, the absence of SQL makes queries a lot of work »
M.Stonebraker (VoltDB, 2011)
➢BRIDGES Between SQL and NO SQL➢New DBMS (Main-memory DBMS , etc.) ex : VOLTDB ➢ Integration of NO SQL access to SQL DBMS Ex : Oracle, IBM, Microsoft,
Teradata…➢« EXTERNAL TABLES » (for external NO SQL data stores) in FROM clause➢HIVE driver (for OSS approach)
![Page 54: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/54.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« From NO SQL to NEW SQL »
« NEW SQL » (on top of SQL) :
➢VoltDB by M. Stonebraker, REDIS, MYSQL, Scale DB, Clustrix, AKIBAN, NUODB (NimbusDB)
➢ and TERADATA BIG DATA, Oracle BIG DATA, BIG QUERY SQL (Google), Microsoft BIG DATA, IBM Big Data…
« Future is polyglot persistence and POLYSTORES »
54
![Page 55: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/55.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Architecture of VOLTDB(2009, M. Stonebraker)
NO BUFFER POOL, NO LOCKING (timestamping)NO WAL (no DATA Log : stored procedure log)NO Threading overhead
Single threaded➢No shared data➢Main memory divided per core➢Open Source
Shared-nothing architecture (cf SMP & LAN)
➢1 TERABYTES of data ➔ one cluster of 30 nodes with 30 G bytes a node➢4 orders of magnitude more important in tps than 25 years ago !
(1000 tps in 1990)
55
![Page 56: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/56.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« NEW SQL »* with Oracle(on Oracle 12C ; only with EXADATA)
➢EXTERNAL TABLES➢HIVE driver➢JSON documents accessed via SQL
* « Unified query for BIG DATA management » Oracle White Paper, January 2015
56
![Page 57: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/57.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« SQL is the lingua franca/esperanto for data management » Oracle 2015
« An object relational Mapper (ORM) can access SQL and NO SQL/Hadoop simply by adding object relations to its existing data stores »
Source:http://www.oracle.com/technetwork/database/nosqldb/learnmore/nosql-database-498041.pdf
![Page 58: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/58.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
ORACLE (Oracle NoSQL Database)
Source : https://docs.oracle.com/cd/E26161_02/html/AdminGuide/introduction.html
![Page 59: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/59.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
ORACLE (Big Data SQL for Oracle NoSQL Database)
with HIVE driver and external tables
Source : https://blogs.oracle.com/NoSQL/entry/bigdata_sql_with_oracle_nosql
![Page 60: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/60.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« Language federation » approach and « EXTERNAL TABLES »
Query franchising and smart SCAN with external tables
![Page 61: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/61.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Teradata
« Now you could benefit the power of MAP REDUCE with the ease of use of SQL. Before with Hadoop, users were
the administrators »
Stephen Brobst, Teradata CTO, (Oct 2012)
![Page 62: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/62.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Teradata
➢UDA : Unified Data Architecture with Hadoop integration➢HDFS (Hadoop Distributed File System), with SQL
➢HCatalog, framework of Open Source metadata developped by Hortonworks
« SQL-H » enables to analyze data stored in HDFS system using SQL
➢ASTER, from Teradata:➢SQL-Map Reduce, which integrates MAP REDUCE functions within SQL
➢Teradata-Aster Big Analytics
![Page 63: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/63.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
SQL-H /Map-Reduce (SQL/MR)*
➢Set of built-in UDF/UDT (User-defined TABLES)
➢2 main functions to enable parallelism➢Row Function (as a mapper) : perform
row-level transformations and processing
➢Partition Function (sharding)(as a reducer) : perform treatment on each group of rows defined by the same PARTITION KEY clause
➢To invoke SQL Map Reduce functions using SQL thru ASTER DB
➢GROUP BY with complex functions ➢Proprietary ML packages
(time series analysis,..)
63
SELECT ...From function_name (
ON Table-or-Query[PARTITION BY expression, ...][ORDER BY expression][Clausename (arg, ...), ...]
*[Eric Friedman et al, 2009]
![Page 64: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/64.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Big Data Polybase for Microsoft
➢ SQL SERVER encompasses Hadoop
➢ Excel Interface with Hadoop➢Sqoop
➢Mahoot (data mining for Hadoop)
➢POLYBASE with Hadoop and Azure
![Page 65: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/65.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
BIG DATA with Blu Acceleration for IBM
Source : http://www.redbooks.ibm.com/redbooks/pdfs/sg248212.pdf
![Page 66: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/66.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
BIG DATA properties : « WHAT ! »
➢Web DATA : « semi-structured data »« Open Data », XML, « Linked DATA » / « Semantic web » (RDF paradigm, SparQL, OWL) Triple Data Store
➢Hadoop/Hive : unstructured data on open source platforms(Hbase)/map-reduce (and KEY-VALUE paradigm)
➢Analytics orientation (ML, DL)
➢Real Time data
66
![Page 67: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/67.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Big Data Systems ! (Aslett, 2013)
![Page 68: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/68.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
BIG DATA Management SYSTEMS and data paradigms
TIPS RICE
WHAT
Big Data SYSTEMS
SQL2, SQL3/ODMGNEW SQL
Codd’s relational data modelVALUE paradigm
OBJECT data model
POINTER-VALUE paradigm (SQL3)OBJECT-VALUE paradigm (ODMG)
N.O. SQLSPARQL(OWL)
PREDICATE-VALUE (RDF) paradigm(Semantic web)
KEY VALUE paradigm(Map Reduce)
![Page 69: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/69.jpg)
Towards BIG SQL and interactive real-time analytics (ML & DL) !
![Page 70: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/70.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Some key SQL3 extensions
➢« External table » for NO SQL data store
➢<Oracle, Informix, Microsoft, Hive, Sybase, Mysql,..>
➢« PARTITION clause » <Teradata> for map reduce & SHARDING
➢« Match clause » for sub graphs (Cipherand GQL)
➢NEST/UNNEST (N1QL)
➢« LIMIT/OFFSET from< UnQL ,2011>;Couchbase and SQL <LITE>https://www.couchbase.com/press-releases/unql-query-languagehttp://unql.sqlite.org/index.html/wiki?name=UnQL+Syntax+Notes
SELECT optional-expression (JSON object)
FROM collections
WHERE expression
GROUP BY exp list
HAVING expression
ORDER Exp List
LIMIT expression OFFSET expression
![Page 71: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/71.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Towards « BIG SQL »
SELECT
FROM{T/Tables}, {V/Views} < SQL2; Create Table..Create View,..){ SQL query} < SQL3>{EXTERNAL TABLES} < N.O.SQL DB>
< Oracle, IBM, Microsoft, Informix, Sybase/SAP, MySQL,..>
{GRAPH} < GQL>
WHERE<POINTER DEREFERENCING operator on REF types> <SQL3><Multiset operators> like NEST/UNNEST <Multivalued attributes><GRAPHS operators>, <MATRICES operators>, UDF/UDT, MAP/REDUCE
PARTITION BY like LIMIT/OFFSET, PIVOT/UNPIVOT<Splitting/Sharding>
MATCH <GQL>
GROUP BY /HAVINGGROUPING SETS with CUBE, ML & DL operators
![Page 72: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/72.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Unifying THEORY for BIG SQL (and polystore or MULTI-MODEL)
« One of the main arguments… was the industry needs a common query language and data model to feed the ecosystem for key-value stores... We are looking forward to working with other industry leaders in the NoSQL space on taking the design to the next level. » Erik Meijer, Microsoft Research, CO-SQL in CACM 2011
« An effective mathematical model that encompasses the concepts of SQL, NoSQL and NewSQL would enable their interoperability » Jeremy Kepner (MIT, 2016)
Cf Following Research SEMINAR on that hot topic whose framework is summarized here
![Page 73: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/73.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
SQL, RDF, NoSQL and NewSQL on an example
➢ SQL : a set of rows within a table < STRUCTURED DATA>
SELECT *From FLIGHT2WHERE Pilotname=‘Serge’;
FLIGHT2 F# PilotName PlaneN
AF100 Serge AirbusA320
AF110 Peter B747
AF102 Serge B747
![Page 74: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/74.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
SEMI-STRUCTURED DATA in RDF
TRIPLEs (OBJECT-PREDICATE-VALUE)
➢To describe WEB resources
(:serge: insureflight:AF100)
(:Peter:insureflight:AF110)
(:AIRBUSA320:isusedinflight: AF100)
(:Paul:ispasengeroflight:AF100) …
➢Note : One triple RDF <S.P.O> is a fact in 1st-order predicate logic : P(S,O) with
P: Predicate, S Subject et O object
Example : Insureflight (Serge, AF100)
![Page 75: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/75.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
RDFS graph (Example)
:Serge
:AF100
AIRBUSA320
Paul
:ispasengeroflight:isusedinflight
:insureflight
:drivesplane
AF102
![Page 76: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/76.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
NoSQL graph for unstructured data
Serge AF100 AF102
A320 B747
PeterAF110
![Page 77: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/77.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
NewSQL matrix
AF100
AF110
AF102
MT V MTV
Serge Peter
![Page 78: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/78.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Need for a simple theory to unify the major data types of a big data system
➢4 types of DATA with 3 theoretical frameworks
➢STRUCTURED (SQL) and SET theory (of VALUES/DATA)
➢SEMI STRUCTURED (SparQL, OWL) : GRAPH THEORY (inferences)
➢UNSTRUCTURED (NOSQL) : GRAPH THEORY
➢NEW SQL and MATRICES (linear algebra)
➢NOTE : DATA SCIENCE (ML and DL ) and MATRICES management
➢Some formal unifying proposals
➢Associative array from MIT 2016 (KEPNER 2016), ..)
(KEPNER 2016) Jeremy Kepner and al : « Associative array model of SQL, NoSQL and NewSQL Data bases » < MIT CS and AI laboratory, 2016>
➢Another attractive theory both in terms of simplicity and implementation
➢CATEGORY theory (MEIJ2011)(MEIJ2011) « A co-Relational Model of Data for Large Shared Data Banks » Erik Meijer and Gavin Bierman, Microsoft Research, CACM 2011
![Page 79: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/79.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Mathematics underlying big data management
DATA type paradigm pties Data model Math theory
Data structures
Data ops REF
Structureddata(TRANSACTION oriented)
VALUE
POINTER/VALUE
TIPS
RICE
Codd’s RM
Object RM (DATE’s3rd Manifesto)
SET
GRAPH
Relation/TABLENF2, CLASS
Relationalalgebra
(Codd70)(DATE95)
Semi-structuredDATA (SEARCH oriented)
PREDICATE/VALUE WHAT RDF data model GRAPH CLASS (RDF 98)
UnstructuredDATA (analyticsoriented)
KEY/VALUE&Graph
WHAT Key/blobKey/docKey/column
GRAPH CLASS & DOCUMENT
NOMAD ALGEBRA (NA)Associative Array (AA) algebra
(Chang08)
NEWSQL(analyticsoriented)
VALUE & Graph RM (sparse)MATRICES
TABLES NA & AA (Cattel 10)
Polystore / / D4 model ARRAYS arrays AA (Duggan15)
![Page 80: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/80.jpg)
Big Questions ?
A new data world is in our hands/feet !
![Page 81: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/81.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Short Bibliography on Big Data
<In French>➢Rudi BRUCHEZ « Les bases de données NO SQL et le Big Data » Eyrolles 2015
➢« BIG DATA et Machine learning », P. Lemberger et al, DUNOD, 2016
➢« Bases de Données » JL Hainaut, Dunod 2015
< In English>➢Dan Mc Greary, Ann kelly « Making sense of NO SQL » Manning 2014
➢Jordan Tigani, Siddartha Naidi « Google Bigquery Analytics » WILEY, 2014 (510 pages)
➢Ian Davis « 30 Minute Guide to RDF and Linked Data » 2009, Slide Share
➢Mike Stonebraker, « New SQL : An Alternative to NoSQL and Old SQL for New OLTP Apps »
ACM, Juin 2011
➢S. Miranda , « Systèmes d’information Mobiquitaires » Revue RTSI, Sept 2011 and
« THE ART AND SCIENCE OF BIG DATA » (2019)
➢W.CHU Editor « Data mining and knowledge Discovery for big data » Springer 2014
➢F.Provost, T Fawcell « DATA SCIENCE for Business » O’Reilly 2013
81
![Page 82: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/82.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Research seminar : towards a unifying theory for BIG DATA management (and data lakes)
➢ Erik Meijer and Gavin Bierman « A co-Relational Model of Data for Large Shared Data Banks », Microsoft Research, CACM 2011
➢ M. Fokkinga, « SQL versus coSQL — a compendium to Erik Meijer’s paper » 2012.
➢ Jeremy Kepner and al : « Associative array model of SQL, NoSQL and NewSQL Data bases » <MIT CS and AI laboratory, 2016>
➢ Kepner, J. Chaidez, V. Gadepally, and H. Jansen, « Associative arrays : Unified mathematics for spreadsheets, databases, matrices, and graphs » arXiv preprint arXiv:1501.05709, 2015.
➢ Gaetan Lescouflair, PH-D 2019, University of Nice Sophia Antipolis (MBDS)
➢ J.Lu et al “Multi model databases and highly integrated polystores” Tutorial CIKM 2018
82
« Everything in Big Data management is a
CATEGORY »
![Page 83: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/83.jpg)
Extra slides
83
![Page 84: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/84.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Data integration (Virtual DATA LAKE)
Solution Query language Data model Target Schema Bridge Query Data Sources
BigIntegrator
[Zhu & Rish 2011] SQL-like Relational LAV Datalog RDBMS and Bigtable ( GQL)
SQL/ MFR [Bondiombouy & al 2015]
SQL-like Relational GAV MFRRDBMS and Distributed Processing Framework (MapReduce et Spark)
NoSQL/ Access pathmapping [Curé & al 2011]
SQL Relational GAV BQL RDBMS, NoSQL
FORWARD Middleware/SQL++ [Ong & al 2014]
SQL++ JSON based GAV -RDBMS, NoSQL, NewSQL, SQL-on-Hadoop
CloudMdsQL[Kolev & al 2016]
SQL-like JSON based Schema-less - RDBMS, NoSQL, HDFS
![Page 85: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/85.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Integrating R in Data Systems
Solution Execution
modelExternal structure
Parallel Process
techniqueBrigde Execution Plateform
SQL Server + R
[Berral & Poggi 2016] In-DB - - Store procedure Microsoft SQL Server
Spark R
[Venkataraman & al 2016] Cluster
Distributed Data
frame
partitioned
execution
Socket server
based on NettySpark Workers
Big R
[Yejas & al 2014] Cluster
proxy to a in-
memory Data
frame, Vector and
List
MPP,
partitioned
execution
JaQL
IBM InfoSphere
BigInsight (Hadoop
version)
Rhipe
[Oancea & Dragoescu 2014] Cluster - MapReduce Protocol Buffer Hadoop
RHadoop
[Oancea & Dragoescu 2014] Cluster - MapReduce
R Wrapper to
Hadoop
Streaming
Framework
Hadoop
![Page 86: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/86.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Pre-processing for ML (DATA LAKE) or temporarydata storage (Velocity): relational data model
➢RELATIONAL ALGEBRA
➢MATRIX ALGEBRA (ex : Interface with R language)
➢WORFLOW with different technologies/models➢Programming patterns (like MAP REDUCE)
➢DATA distribution on parallel processors (instead of code distribution)
➢MAP REDUCE are application dependent
![Page 87: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/87.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
BIG DATA in the CLOUD
➢ BIG QUERY de Google
➢ AZURE SQL de Microsoft
➢ TERADATA
➢ IMPALA (Cloudera) avec compatibilité cubes OLAP et Big Data
➢ REDSHIT (Amazon) sur base Postgres
➢ PRESTO (Facebook)
➢ DRILL (version Open source de... Dremel)
➢ SNOWFLAKE…
87
![Page 88: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/88.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
Mathematical entities of a category(wikipedia)
A category C consists of the following mathematical entities :
➢A class ob(C), whose elements are called objectsA class hom(C), whose elements are called morphisms or maps or arrows .➢Each morphism f has a source object a and target object b.➢The expression f : a → b, would be verbally stated as "f is a morphism from a to b”➢The expression hom(a, b) – alternatively expressed as homC(a, b), mor(a, b), or C(a, b) denotes
the hom-class of all morphisms from a to b.
➢A binary operation ∘, called COMPOSITION of morphims, such that for any three objects a, b, and c, we have hom(b, c) × hom(a, b) → hom(a, c).
The composition of f : a → b and g : b → c is written as g ∘ f or g ∘ f, governed by two axioms :
ASSOCIATIVITY : If f : a → b, g : b → c and h : c → d then h ∘ (g ∘ f) = (h ∘ g) ∘ f, and
IDENTITY : For every object x, there exists a morphism 1x : x → x called the identity morphism such that for every morphism
f : a → b, we have 1b ∘ f = f = f ∘ 1a.
![Page 89: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/89.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
MORPHISMS (arrows in Category)
➢Relations among morphisms (such as fg = h) are often depicted using commutative diagrams, with "points" (corners) representing objects and "arrows" representing morphisms.
➢Morphisms can have any of the following properties. A morphism f : a→b is a :➢monomorphism (or monic) if f ∘ g1 = f ∘ g2 implies g1 = g2 for all morphisms g1, g2 : x → a.
➢epimorphism (or epic) if g1 ∘ f = g2 ∘ f implies g1 = g2 for all morphisms g1, g2 : b → x.
➢ bimorphism if f is both epic and monic.
➢ isomorphism if there exists a morphism g : b → a such that f ∘ g = 1b and g ∘ f = 1a.[b]
➢endomorphism if a = b. end(a) denotes the class of endomorphisms of a.
➢automorphism if f is both an endomorphism and an isomorphism. aut(a) denotes the class of automorphisms of a.
➢retraction if a right inverse of f exists, i.e. if there exists a morphism g : b → a with f ∘ g = 1b.
➢section if a left inverse of f exists, i.e. if there exists a morphism g : b → a with g ∘ f = 1a.
![Page 90: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/90.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
FUNCTORS (Category)
➢FUNCTORS are structure-preserving morphisms between categories.
➢A (covariant) functor F from a category C to a category D, written F : C → D, consists of :
➢for each object x in C, an object F(x) in D ; and
➢for each morphism f : x → y in C, a morphism F(f) : F(x) → F(y),
➢such that the following two properties hold :
➢For every object x in C, F(1x) = 1F(x) ;
➢For all morphisms f : x → y and g : y → z, F(g ∘ f) = F(g) ∘ F(f).
➢A contravariant functor F: C → D is like a covariant functor, except that it "turnsmorphisms around" ("reverses all the arrows").
➢More specifically, every morphism f : x → y in C must be assigned to a morphismF(f) : F(y) → F(x) in D. In other words, a contravariant functor acts as a covariant functor fromthe opposite category Cop to D
![Page 91: From data bases to big data (7 lectures) MBDS graduate coursembds-fr.org/wp-content/uploads/2019/03/lectures/l7.pdf · «From data bases to big data » (7 lectures) MBDS graduate](https://reader036.vdocuments.mx/reader036/viewer/2022070719/5edf8a88ad6a402d666ae12b/html5/thumbnails/91.jpg)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)
« NATURAL TRANSFORMATION » (Category)
Note : Historical rationale for CATEGORY !➔ Functors
A natural transformation is a relation between two functors. Functors often describe"natural constructions" and natural transformations then describe "natural homomorphisms" between two such constructions. Sometimestwo quite different constructions yield "the same" result; this is expressed by a natural isomorphism between the two functors.
If F and G are (covariant) functors between the categories C and D,
then a natural transformation η from F to G associates to every object X in C a morphism ηX : F(X) → G(X)
in D such that for every morphism f : X → Y in C, wehave ηY ∘ F(f) = G(f) ∘ ηX; this means that the following
diagram is commutative:
The two functors F and G are called naturally isomorphic if there exists a naturaltransformation from F to G such that ηX is an isomorphism for every object X in C.