augmenting big data analytics with nirvana

1

Take advantage of ALL of your data

Augmenting Big Data Analytics

with Nirvana

Sept 2016 Igor Sfiligoi

2

• Nirvana® is a metadata, data placement

and data management solution optimized for

managing distributed unstructured data

• It supports many modes of operation

– In this talk we explore only how it

fits in a Big Data Analytics context– All the other capabilities can be used alongside, but will not be discussed

• Nirvana is a commercial software product,

developed by General Atomics

• More information at:– http://www.ga.com/nirvana

– https://en.Wikipedia.org/wiki/Nirvana_(software)

What is Nirvana?

http://www.ga.com/nirvana

https://en.wikipedia.org/wiki/Nirvana_(software)

3

• Big Data Analytics is

– The process of examining

large and diverse data sets to uncover

hidden patterns and previously unknown correlations

– Extensively used both in the enterprises

and in science circles

• No single tool can do the whole job

– Custom data extraction needed

to accommodate all the possible data formats

– Efficient filtering and processing frameworks needed

due to the large data volumes

What is Big Data Analytics?

4

• Structured data

– Well defined schema

– e.g rows in a database

• Unstructured data

– Usually describes something in great detail

– Requires custom code to extract actionable information

– e.g. images -> walls with cracks , or

raw instrument readouts -> phase change coordinates

• Semi-structured data

– No fixed schema, but still easily parsable

– Several variants:• Subset of schema fixed, others optional

• Tree like structures, where each level is well defined, depth variable

• Self describing structures

– e.g. JSON documents

Types of data

5

Most data

comes as

unstructured data

Final analysis

must be done on

structured data

Data bridging

How do we bridge the gap?

6

• Most data comes as unstructured data

• Final analysis must be done on structured data

• How do we bridge the gap?

– The final structured data is refined

from the original unstructured (raw) data

– The structured data is often called metadata

• Two extremes to get from raw data to metadata

– Extract metadata during ingest, drop raw data

– Keep raw data, extract metadata during analysis

Data refinement

7

• Extracting data at ingest time

– Makes analysis very fast

– But very rigid,

can only answer a fixed number of questions

• Sometimes called ETL (Extract And Transform)

• This is where traditional (SQL) databases shine

– Example single node DBs: PostgreSQL, MariaDB, …

– Example large scale DBs: Teradata, Oracle, …

Refinement at ingest

8

• Refining data at analysis time

– Extremely flexible, can answer any question

– Extremely (computationally) expensive

• Recent Big Data frameworks were developed to

tackle this at scale

– e.g. Hadoop’s MapReduce

Refinement during analysis

9

• In practice, everyone wants it both way

– Fast, and

– Flexible

Two basic approaches:

• Semi-structured data

– Keep much more metadata, with flexible schema

– Make is relatively cheap to further refine

• Tiered systems

– Extract some metadata at ingest time

– Keep the original raw files

– Link the two together

The middle road

Metadata could be semi-structured

10

• Using the semi-structured approach enables

– Much more flexibility

– Can use some of the optimization techniques

used with truly structured data

• However

– Still cannot answer all the questions

(we lost a large fraction of the original information)

– Still not as fast as truly structured data

(flexibility has its price)

• Popularized by recent NoSQL databases

– e.g. MongoDB

– Most “traditional” (SQL) databases have added these

capabilities over the past few years, too

The semi-structured approach

11

• A tiered approach uses the best tool for the job

– A database for the metadata (possibly SQL, but not required)

– A Big Data framework for raw data processing

– A metadata-aware data management system for linking the two

• The best tool is used as appropriate

– Use the database whenever possible(i.e. if it fits in the domain of existing metadata)

– Else

• Use the data management system to get the subset of

raw data objects to analyze (as much as possible)

• Use the Big Data framework on the subset to get the desired answers

– If appropriate, feed the new metadata into the database

The tiered approach

12

Tiered Analytics in a picture

Can be

solved with

existing

metadata??

Answer

Process

raw filesProcess

raw filesProcess

raw filesProcess

raw files

Big Data framework

(Optionally)

Save extracted

metadata, so next

queries run faster

Database

Query

Database

Relevant

raw files

Mine metadata

Query

Database

Mine available

metadata

yes

no

Physical

storage

Pre-digested Tier

Fast but limited

Power Tier

Flexible but slower

13

Winning strategy – stage one

Composed of three layers

• Database

• Big Data Framework

• Metadata-aware

data management system

A tiered approach to

Big Data Analytics

provides the

best competitive advantage

14

• Nirvana is the metadata-aware data management system

– Provides the means for linking metadata

with raw data objects

• Three fundamental roles

– Provides standardized schema

– Manages registration of files in the database(plus updates, renames and deletions, autonomously)

– Bridges database and storage security domains(user identity and permission)

• Additionally, automated extraction of metadata from files

– Triggered on creation and update

– Extraction rules defined by system administrators

– But users can add additional metadata anytime(if authorized)

Nirvana’s role

15

• But what about Big Data SQL databases?

– e.g. Hive, Presto

• Tools like Hive are just a cost saving solution

– They do not provide capabilities not-present in

high-end “traditional” SQL databases, like Teradata

– But they do provide a better value per TByte stored

(at a moderate cost in query performance)

• They should be used as an additional tier

– Hot metadata in a “traditional” database

– Rarely used metadata in a “low cost” database

– Possibly with transparent gluing between them(e.g. Teradata QueryGrid)

Wait a minute…

16

The slides so far were assuming

a homogeneous environment

• Not a very realistic scenario

these days

A typical enterprise will have

several storage and

compute technologies deployed

• Organized into a Data Lake

Big Data Analytics in a Data Lake

17

• A single logical repository for

all data handled by an enterprise

– As opposed to having

different data in different data silos

• Logically integrated

storage and compute infrastructure

– Since data analytics requires both

• See alsohttp://www.slideshare.net/igor_sfiligoi/creating-a-real-data-lake-with-nirvana

What is a Data Lake?

http://www.slideshare.net/igor_sfiligoi/creating-a-real-data-lake-with-nirvana

18

• All the infrastructure is logically related, but

– Different technical solutions

are optimized for different factors

• e.g. speed vs reliability vs cost

– Not every compute platform will work

with every storage solution

• During Big Data Analysis, data must often

be migrated between repositories

– Often just to maximize efficiency

– Sometimes there simply is no other option

Data Lake Analytics challenges

19

• Moving data around manually not an option

• A flexible data management system essential

– Global namespace

– Transparent, fully automated

data movement and replication

– Able to interface with

solutions from multiple vendors

• And it also must be metadata-aware

– Tiered Big Data analytics needs metadata-file pairing

– These pairs must be preserved across file moves/replicas

Truly integrated infrastructure

20

Real Big Data Analytics in a picture

Can be

solved with

existing

metadata?? Answer

The appropriate

Big Data

framework

(Optionally)

Save extracted

metadata, so next

queries run faster

Database

Query

Database

Relevant

raw files

Mine metadata

Query

Database

Mine available

metadata

yes

no

Compatible

storageArchival

Storage

Locate files and

handle data

movement

(if needed)

Cloud

Storage Interactive

Storage

Logical to physical

file mapping

Pre-digested Tier

Fast but limited

Data Lake Tier

Flexible but slower

Data Management

Layer

21

Winning strategy – stage two

Big Data Analytics

over a

truly integrated Data Lake

provides the

best competitive advantage

Composed of three layers

• Database

• Data Lake

• Flexible, metadata-aware

data management system

22

• Nirvana is the flexible,

metadata-aware data management system

– Metadata capabilities described in previous slides

• Supports multiple storage technologies,

from multiple vendors

– Creates a logical, global namespace

• Fully integrated data movement

and replication capabilities

– Can be API driven

– Plus, a fully automated policy engine, too

Nirvana’s role

See also: http://www.slideshare.net/igor_sfiligoi/building-a-global-namespace-with-nirvana

http://www.slideshare.net/igor_sfiligoi/building-a-global-namespace-with-nirvana

augmenting big data analytics with nirvana

Technology