big data: data analysis boot camp non-sql and r · intro. non-sql dbms hands-on q & a conclusion...

1/19

Intro. Non-SQL DBMS Hands-on Q & A Conclusion References Files

Big Data: Data Analysis Boot CampNon-SQL and R

Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

31 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 201931 March 2019

c©Old Dominion University

2/19


Table of contents (1 of 1)

1 Intro.

2 Non-SQL DBMSClassic Non-SQL databases

3 Hands-onAirport connections as agraph databaseSummary

Strengths and weaknesses

Applicabilities

4 Q & A5 Conclusion6 References7 Files


3/19


What are we going to cover?

1 Brief overview of differentNon-SQL technologies

2 Revisit our airport service data

3 Ask, and answer some questionsabout the airports


4/19


Classic Non-SQL databases

Words from the past.

Bring up the attached PolyglotPersistence presentation.We’ll be looking at pages 1 – 22.

Attached file.


5/19



Same image.

Attached file.


6/19



Finding a “friend of a friend”

A common question is: who is afriend of a friend?It comes up in all sorts ofrelationship type questions. Notonly interpersonal; but alsoorganizational, system analysis,law, etc.Easily answered in somelanguages, harder in others.

Image from [1].


7/19



Same image.

Image from [1].c©Old Dominion University

8/19


Airport connections as a graph database

Revisit our airport data.

We’re going to look at theairport data in a different way.

Airports become nodes (or vertices)

Service become edges (or arcs)

Load the attached file.chapter-06-nosql-R.R


9/19



Overview of the program

1 By default the database is reset each time main() is executed

main(resetDB = TRUE)

...

if (resetDB == TRUE)

{d

10/19



A few database initialization details

1 Need to ensure that Airport nodes are unique

addConstraint(graph, "Airport", "name")

2 Create the airport location file and load a subset into thedatabase

createTextFile(airportLocationFile, airportLocationURL, overwrite=TRUE)

temp

11/19




1 Resulting in:

[1] "Creating airport info nodes -- Dumping the object: system.time(cypher(graph, command)) (of type: double, class: proc_time)"

user system elapsed

0.008 0.004 0.6752 The origin and destination data is loaded and cleaned

unzip(flightDataZipFileName, files=flightDataFileName, exdir=tempDir)

unzipFileName

12/19




1 Chunks of data are loaded

for (i in 1:(length(chunks) - 1)) {write(x=c(’"src","dest"’, paste0(df$ORIGIN[chunks[i]:(chunks[i+1] - 1)], ",",

df$DEST)[chunks[i]:(chunks[i+1] - 1)]), file=tempFile)

command

13/19



Ways to modify the Airport program.

A CYPHER1 statement or R2 can be used to query or modify thedatabase, and R can be used for the numeric heavy lifting.

Use distance between airports as a metric to find the “diameter” of thegraph.

Find the connectiveness (degree) distribution of the airports.

Use an airport’s connections (degreeness) to identify the “mostimportant” airport (may not be the one with the highest degree).

Find the path between “interesting” airports, and then remove an airportalong the path. Is there another path from the source to the destination?

Update the missing location information.

1https://neo4j.com/docs/developer-manual/current/cypher/

2ls(“package:RNeo4j”)c©Old Dominion University

https://neo4j.com/docs/developer-manual/current/cypher/

14/19


Summary

Good and not so good

Strengths:

A graph database — typeless, schemaless,unstructured relationships

Large capacity (˜34.4 billion nodes, andrelationships)

ReSTful interfaces — means lots ofdifferent language support

Weaknesses:

Graph terminology is not consistent —node vs. vertex, arc vs. edge, etc.

Sharding is not supported

Licensing may be an issue for productionapplications


15/19


Summary

Good for, and not so good for

Good fit;

Anything that can be represented as a “socialgraph”

Any “link rich” domain

Routing, dispatch, and location based services(getting from A to B)

Recommendation engines (“also bought”statements)

Not so good fit:

When updating “all” items in a DB (requirestotal graph traversal)


16/19


Q & A time.

Q: How many marketing peopledoes it take to change a lightbulb?A: I’ll have to get back to you onthat.


17/19


What have we covered?

Talked about different types ofNo-SQL database technologies andwhat they are good for“Played” with the airport servicedata as a graph databaseAsked and answered somequestioned geared towards graphdatabase technology

Next: Looking at crime data


18/19


References (1 of 1)

[1] Marko A. Rodriguez, Problem-Solving using Graph Traversals,https://www.slideshare.net/slidarko/

problemsolving-using-graph-traversals-searching-

scoring-ranking-and-recommendation/88-Searching_

Friends_SQLMySQL_vs_GremlinNeo4jWhat, 2010.


https://www.slideshare.net/slidarko/problemsolving-using-graph-traversals-searching-scoring-ranking-and-recommendation/88-Searching_Friends_SQLMySQL_vs_GremlinNeo4jWhathttps://www.slideshare.net/slidarko/problemsolving-using-graph-traversals-searching-scoring-ranking-and-recommendation/88-Searching_Friends_SQLMySQL_vs_GremlinNeo4jWhathttps://www.slideshare.net/slidarko/problemsolving-using-graph-traversals-searching-scoring-ranking-and-recommendation/88-Searching_Friends_SQLMySQL_vs_GremlinNeo4jWhathttps://www.slideshare.net/slidarko/problemsolving-using-graph-traversals-searching-scoring-ranking-and-recommendation/88-Searching_Friends_SQLMySQL_vs_GremlinNeo4jWhat

19/19


Files of interest

1 Neo4J Airport connection

script

2 R library script file

3 Polyglot persistence (a

PDF presentation)

4 Making spinnable globes

with airport data

5 Code snippets


rm(list=ls())

## http://nick.readthedocs.io/en/latest/Big_Data/neo4j_examples/

## https://neo4j.com/docs/developer-manual/current/cypher/

## https://neo4j.com/docs/operations-manual/current/configuration/file-locations/

options(java.parameters = "-Xmx8192m")

library(RNeo4j)library(sp)library(rworldmap)library(rworldxtra)

source("library.R")

source("iataParsing.R")

source("chapter-06-library.R")

main

1/37

A little history A change in the air Database layouts CRUDy stuff Databases that I/we use Conclusion References

CS-695NoSQL Database

Polyglot Persistence; Or, The Many Ways WeStore Data

Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge

27 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 2015

2/37


Table of contents I

1 A little history

2 A change in the air

3 Database layouts

4 CRUDy stuff

5 Databases that I/we use

6 Conclusion

7 References

3/37


Hammer and nails . . .

“. . . it is tempting,if the only tool youhave is a hammer, totreat everything as if itwere a nail.”

Abraham H. Maslow [8]

4/37


Miscellania

Origin of “polyglot . . . ”

Popularized by Neal Ford [4]:

Talked about software development

How things are evolving (SQL,XML, .NET, etc.)

How multi-threading is hard(concurrency, coordination, etc.)

Promoted the idea of enterprisedevelopment via Java and .NET

Take away: choose the right tool for thejob.

Different languages will continue to exist because each is good atsomething and all are necessary.

5/37


Miscellania

The world BC (Before Codd).

Databases existed before EdgarCodd.

Hierarchical approach – aliveand well in our file system

Network approach –currently underpinning ideasfor graph databases

These suffered because peoplehad to know lots of details abouthow the database wasimplemented.

6/37


Miscellania

The world after Codd.

Separate representation fromimplementation

Changes in database foroptimization needn’t affect dataqueries

User interactions aren’t clutteredby “construction noise” (includingindexing and sorting)

Codd’s relational data bank hides allimplementation information.

Relational database management systems (RDBMS) hidinginformation about how data is stored. Data language isindependent of how data is stored [3].

7/37


Miscellania

The world according to RDBMS.

Everything is neat and tidy

Everything can be defined ina set of tables that haverelationships between them

If you make the databaselarge enough, you can storeanything and ask anyquestion

Image from [10].

RDBMS reigned supreme for 30 - 40 years (starting in 1970). Andthen reality and Big Data started to hit.

8/37


How we turned and started to get to now.

And then things started changing.

Can’t point a finger at a specificincident, might be a critical mass.

The Internet made it easier tocollect data.

A new generation of peoplethought about things in a differentway.

The new data had three attributes:velocity, volume, variety [7].

New ways of looking at dataencouraged new questions.

People wanted answers faster.

Many of these items couldn’t be supported by a RDBMS.

9/37


Make things faster.

Simple and complex ways

How to get more processingpower to answer databasequestions?? Basically:

Scale up – buy faster CPUand more RAM

Scale out – buy more CPUsand get them to work inparallel

Scaling up with custom CPUsgets expensive very, very quickly.

Image from [9].

Commodity CPUs are almost a dime a dozen. Leading to clusters,network services, distributed applications, etc.

10/37


Make things faster.

Amdahl’s Law [1]

Division and measurement of serial and parallel operations appearstime and again. (Shades of Mandelbrot.)

“Make the common fast.”

“Make the fast common.”

Understand what parts haveto be done serially.

Understand what parts canbe done in parallel.

Need to factor in “overhead” costs when computing speed up.

11/37


Make things faster.

Amdahl’s Law (A summary)

Time for serial executiondef.== T (1)

Portion that is NOT beparalyzable

def.== B ∈ (0, 1]

Number of parallel resourcesdef.== n

T (n) = T (1) ∗ (B + 1n(1− B))

Speed updef.== S(n)

S(n) = T (1)T (n)

= 1B+ 1

n(1−B)

Dr. Gene Amdahl (circa 1960)

12/37


The questions changed.

We knew that we didn’t know.

Our questions and our data changed.RDBMS had limitations:

Supported ad hoc questions onpredefined data

Didn’t support undefined orunstructured data

Could scale up not out, sodatabase size was practicallylimited

SQL predicate calculus madelogic awkward

RDBMS are very, very good at somethings, but user needs were changing.

13/37



What happens when we ask a different question??

When the RDBMS database was designed, wethought we knew what we wanted to know.That was then.

Now if we want to look at familyrelationships (parent, child, sibling,extended family, etc.)

We can add a column to the table forup/down relationships

We can add a column for side to siderelationships

We can add a column for extended familyrelationships

The database doesn’t look like how we thinkabout the problem.

When the data representation doesn’t match how we think, then something has

to change.

14/37


A collection of different database layouts.

A RDBMS

Can add well formed data easily

Difficult to add new data fields ortypes

Each row is expected to have thesame data

Supports unknown (ad hoc) querieswell

Scales up not out

Popular RDBMS: Oracle, MySQL, MSSQL Server, PostgreSQL

The “King of the World” for a very long time. (A version lives inyour phone.)

15/37



A columnar database

Takes the idea of a roworientated database and turns iton its side.

Can add new columns easily

Each row can have differentuse different columns

Scales up and out

Popular column orienteddatabases: IBM DB2, Sybase IQ,Teradata Image from [2].

16/37



A Key-Value design

A number (called the key) locates all otherdata (the value[s]).

Use math on some data (may be morethan one piece)

The math (hash function) returns onevalue (the key)

Use the key to find the rest of the data

Locating data can be fast

Hash function should return unique values

Popular Key-Value DBMS: Redis, Memcached,Amazon DynamoDB, Riak

Key-value databases are fast when using the hash function. Not so fast if you

aren’t.

17/37



An Online Analytical Processing (OLAP) design

A way to visualize and analyze data using a“data cube” and basic functions:

Basic functions:

1 Consolidation (roll-up) of themulti-dimensional data

2 Drill-down into the data3 Slicing and dicing

Fast execution time

Incorporates aspects of navigational,hierarchical, and relational databases

Popular OLAP databases: Hyperion Solutions,Cognos, MicroStrategy, Applix

Image from [15].

Target users are business analysts and business process management.

18/37



A Graph design

A very different way to think about data.

Consists of two parts:

1 Node (something that exists asan entity in the database)

2 Arcs (something that describes arelationship between nodes)

You can have nodes without arcs. Youcan not have arcs without nodes. Arcscan be unidirectional.

Popular graph databases: Neo4j, OrientDB,Titan, Giraph

Image from [6].

Questions are driven by the relationships between nodes vice the nodes

themselves.

19/37



A document design

Document oriented databases can be “viewed,”and can have internal document databases(recursively).

Database is organized based on “tags”

Tag’s meaning is instance dependent

Tags can be nested (recursively)

Database structure maybe XML basedand represented in different ways

Popular document databases: MongoDB,CouchDB, Couchbase, MarkLogic

Sometimes document databases show up in unexpected places.

20/37


Which design to use?

If I had a hammer, . . .

Questions to ask:

1 How much data will be in thedatabase??

2 Will I be reading mostly??

3 Will I be writing mostly??

4 How accurate must the data be??

5 How many simultaneous readersand writers??

6 How robust/resilient must thedatabase be??

7 How will the database beaccessed??

8 What about ACID vs. BASE??

So many choices.

21/37



ACID vs. BASE

One is a design principle, the other is counter marketing.

ACID [5]1 A – Atomicity - all or nothing2 C – Consistency - database is always valid3 I – Isolation - concurrent equal serial ops.4 D – Durable - the database is written to disk

A database action will completecompletely.

BASE [12]1 BA – Basically Available2 S – Soft state - user guarantees consistency3 E – Eventually consistent

A database action will probably completeeventually.

ACID comes with SQL. BASE comes with NoSQL.

22/37



Consistency, Availability, Partition tolerance (CAP)Theorem

Sharing data in distributed systems ishard.

Data can be consistent across thesystem

Data can be available across thesystem

The system can continue tofunction if partitioned/split

You only get to choose two.

Image from [17].

RDBMS on a single machine means partition is undefined. Distributed systems

only get two.

23/37


Create — darkness was on the face of the deep.

Ex nihilo nihil fit (out of nothing, nothing comes).

The CRUD approach doesn’t say what happened before the C.

RDBMS CREATE DATABASE db name;

CREATE TABLE table name (column name1 data type(size),column name2 data type(size), . . . );

Columnar

CREATE DATABASE

CREATE table name, column name1,column name2, ...;

Key-Value, Graph, Document

CREATE DATABASE

CREATE table name

Graph, Document

CREATE DATABASE

Image from [11].

Implementation agnostic.

24/37



Create an entry

RDBMSINSERT INTO table name VALUES (value1,value2,value3,...);

ColumnarPUT table name, row name, column name1:, “value”;

Key-ValueADD table name, key value, value;

GraphCREATE relationship name, vertex name1, vertex name2

DocumentINSERT table name (GML/XML/JSON “marked up” data)

25/37


Report — databases aren’t much good if you can’t get stuff out.

Report/Retrieve data an entry from the database

RDBMSSELECT column name,column name FROM table name;

ColumnarGET table name, row name1:, column name:;

Key-ValueGET table name, key value;

Graph (pipe operations)GET VERTEX|EDGE FILTER(expression) (. . . )

DocumentFIND document id

26/37


Update — things change.

Update an entry

RDBMS

UPDATE table name SET column1=value1,column2=value2,... WHEREsome column=some value;

Columnar

DELETE FROM table name WHERE [expression];

PUT table name, row name, column name1:, “value”;

Key-Value

SET table name, key value, value;

Graph

GET VERTEX | EDGE FILTER(expression) (. . . ) REMOVE propertyADD property

Document

UPDATE document id value (same format as CREATE)

27/37


Delete — to remove that which once was.

Delete an entry

RDBMSDELETE FROM table name WHEREsome column=some value;

ColumnarDELETE FROM table name WHERE [expression];

Key-ValueDROP table name, key value;

GraphGET VERTEX|EDGE FILTER(expression) (. . . ) REMOVE

DocumentREMOVE document id value

28/37


Lots and they are hidden.

Shopping as an example

Firefox – SQLite for browserhistory

Shopping cart – Key-Valuebased on session ID

Recommended purchases –graph database

Credit card payment – SQLdatabase

Excel record purchase –document

Save Excel file – hierarchicaldatabase

29/37


A continuum.

Things from a 50,000 foot perspective

Messy Neat andtidy

Rigid

Ad-hoc

Data

Queries

Free textK-V

Doc.

OLAP

Col.

RDBMS

30/37


A continuum.

Notional strengths and weaknesses

Database type

RDBMS K-V Col. Doc. Graph

ACIDBASE

Ad-hoc queries∆ Hardware

Hardware failure

SupportedNot supported by data model

No statement

31/37


Where can I get these things??

Popular open source databases

RDBMS – MySQL,PostrgreSQL, SQLite

Key-Value – Redis,Memcached, Riak

Columnar – HBase,Accumulo, Hypertable

Document – MongoDB,CouchDB, Couchbase

Graph – Neo4j, OrientDB,Titan Image from [16].

Open source does not mean free; your time costs money.

32/37


In summary . . .

What can we say??

1 Each type of databasedesign fills a specificneed/niche.

2 Each type could do the workof the others

1 Each type has a datamodel tailored to itsproblem domain

2 Performance is tied to thehardware (CPU and I/O)

RDBMS has been the King for a long time. Expect it to remain sodue to inertia.

33/37


NoSQL Distilled: A Brief Guide to the Emerging Worldof Polyglot Persistence

by Sadalage and Fowler [14].

Book to be used and refered toduring the course, ISBN9780321826626.

34/37


Seven Databases in Seven Weeks: A Guide to ModernDatabases and the NoSQL Movement

by Redmon and Wilson [13].

A very nice and graspable tour ofvarious NoSQL database types.Examples of each type ispresented with exercises that canbe completed in a weekend.Book to be used and refered toduring the course, ISBn9781934356920.

35/37


References I

[1] Gene M Amdahl, Validity of the single processor approach to achievinglarge scale computing capabilities, Proceedings of the Spring JointComputer Conference, ACM, 1967, pp. 483–485.

[2] Dale Anderson, Column oriented database technologies,http://www.dbbest.com/blog/column-oriented-database-technologies/,2012.

[3] Edgar F. Codd, A relational model of data for large shared data banks,Communications of the ACM 13 (1970), no. 6, 377–387.

[4] Neal Ford, Polyglot programming,http://memeagora.blogspot.com/2006/12/polyglot-programming.html,2006.

[5] Jim Gray, The transaction concept: Virtues and limitations, Very LargeDatabases, vol. 81, 1981, pp. 144–154.

http://www.dbbest.com/blog/column-oriented-database-technologies/

http://memeagora.blogspot.com/2006/12/polyglot-programming.html

36/37


References II

[6] Andy Hogg, Whiteboard it the power of graph databases,http://www.computerweekly.com/feature/Whiteboard-it-the-power-of-graph-

2013.

[7] Doug Laney, 3d data management: Controlling data volume, velocity andvariety, META Group Research Note 6 (2001).

[8] Abraham H. Maslow, The psychology of science, Henry Regency, 1966.

[9] Andrea Mauro, Storage scale-up vs. scale-out,http://vinfrastructure.it/2014/06/scale-out-vs-scale-in/,2014.

[10] David Mertz, Xml matters: Putting xml in context with hierarchical,relational, and object-oriented models,http://www.ibm.com/developerworks/library/x-matters8/, 2001.

http://www.computerweekly.com/feature/Whiteboard-it-the-power-of-graph-databases

http://vinfrastructure.it/2014/06/scale-out-vs-scale-in/

http://www.ibm.com/developerworks/library/x-matters8/

37/37


References III

[11] Brian Panulla, If libraries were like relational databases,http://ghostednotes.com/2010/12/31/if-libraries-were-like-relational-

2010.

[12] Dan Pritchett, Base: An acid alternative, Queue 6 (2008), no. 3, 48–55.

[13] Eric Redmond and Jim R Wilson, Seven databases in seven weeks,Pragmatic Bookshelf, 2012.

[14] Pramod J Sadalage and Martin Fowler, Nosql distilled, PearsonEducation, 2012.

[15] DatabaseJournal Staff, Examples of sql server implementations, DatabaseJournal (2010).

[16] Wikipedia Staff, Database,https://en.wikipedia.org/wiki/Database, 2015.

[17] Saeid Zebardast, Said experts, http://blog.zebardast.ir/, 2015.

http://ghostednotes.com/2010/12/31/if-libraries-were-like-relational-databases

https://en.wikipedia.org/wiki/Database

http://blog.zebardast.ir/

A little history

Miscellania

A change in the air

How we turned and started to get to now.

Make things faster.


Database layouts



CRUDy stuff


Report — databases aren't much good if you can't get stuff out.

Update — things change.

Delete — to remove that which once was.

Databases that I/we use

Lots and they are hidden.

A continuum.

Where can I get these things??

Conclusion

References

''Chuck Cartledge''

## https://www.r-bloggers.com/how-to-draw-connecting-routes-on-map-with-r-and-great-circles/

rm(list=ls())

library(tidyverse)library(maps)library(geosphere)library(rgl)library(png)

library(RNeo4j)library(igraph)

source("library.R")source("chapter-06-library.R")

plot_my_connection =0), ...) lines(subset(inter, lon

big data: data analysis boot camp non-sql and r · intro. non-sql dbms hands-on q & a conclusion...

Documents