augmenting big data analytics with nirvana
TRANSCRIPT
1
Take advantage of ALL of your data
Augmenting Big Data Analytics
with Nirvana
Sept 2016 Igor Sfiligoi
2
• Nirvana® is a metadata, data placement
and data management solution optimized for
managing distributed unstructured data
• It supports many modes of operation
– In this talk we explore only how it
fits in a Big Data Analytics context– All the other capabilities can be used alongside, but will not be discussed
• Nirvana is a commercial software product,
developed by General Atomics
• More information at:– http://www.ga.com/nirvana
– https://en.Wikipedia.org/wiki/Nirvana_(software)
What is Nirvana?
3
• Big Data Analytics is
– The process of examining
large and diverse data sets to uncover
hidden patterns and previously unknown correlations
– Extensively used both in the enterprises
and in science circles
• No single tool can do the whole job
– Custom data extraction needed
to accommodate all the possible data formats
– Efficient filtering and processing frameworks needed
due to the large data volumes
What is Big Data Analytics?
4
• Structured data
– Well defined schema
– e.g rows in a database
• Unstructured data
– Usually describes something in great detail
– Requires custom code to extract actionable information
– e.g. images -> walls with cracks , or
raw instrument readouts -> phase change coordinates
• Semi-structured data
– No fixed schema, but still easily parsable
– Several variants:• Subset of schema fixed, others optional
• Tree like structures, where each level is well defined, depth variable
• Self describing structures
– e.g. JSON documents
Types of data
5
Most data
comes as
unstructured data
Final analysis
must be done on
structured data
Data bridging
How do we bridge the gap?
6
• Most data comes as unstructured data
• Final analysis must be done on structured data
• How do we bridge the gap?
– The final structured data is refined
from the original unstructured (raw) data
– The structured data is often called metadata
• Two extremes to get from raw data to metadata
– Extract metadata during ingest, drop raw data
– Keep raw data, extract metadata during analysis
Data refinement
7
• Extracting data at ingest time
– Makes analysis very fast
– But very rigid,
can only answer a fixed number of questions
• Sometimes called ETL (Extract And Transform)
• This is where traditional (SQL) databases shine
– Example single node DBs: PostgreSQL, MariaDB, …
– Example large scale DBs: Teradata, Oracle, …
Refinement at ingest
8
• Refining data at analysis time
– Extremely flexible, can answer any question
– Extremely (computationally) expensive
• Recent Big Data frameworks were developed to
tackle this at scale
– e.g. Hadoop’s MapReduce
Refinement during analysis
9
• In practice, everyone wants it both way
– Fast, and
– Flexible
Two basic approaches:
• Semi-structured data
– Keep much more metadata, with flexible schema
– Make is relatively cheap to further refine
• Tiered systems
– Extract some metadata at ingest time
– Keep the original raw files
– Link the two together
The middle road
Metadata could be semi-structured
10
• Using the semi-structured approach enables
– Much more flexibility
– Can use some of the optimization techniques
used with truly structured data
• However
– Still cannot answer all the questions
(we lost a large fraction of the original information)
– Still not as fast as truly structured data
(flexibility has its price)
• Popularized by recent NoSQL databases
– e.g. MongoDB
– Most “traditional” (SQL) databases have added these
capabilities over the past few years, too
The semi-structured approach
11
• A tiered approach uses the best tool for the job
– A database for the metadata (possibly SQL, but not required)
– A Big Data framework for raw data processing
– A metadata-aware data management system for linking the two
• The best tool is used as appropriate
– Use the database whenever possible(i.e. if it fits in the domain of existing metadata)
– Else
• Use the data management system to get the subset of
raw data objects to analyze (as much as possible)
• Use the Big Data framework on the subset to get the desired answers
– If appropriate, feed the new metadata into the database
The tiered approach
12
Tiered Analytics in a picture
Can be
solved with
existing
metadata??
Answer
Process
raw filesProcess
raw filesProcess
raw filesProcess
raw files
Big Data framework
(Optionally)
Save extracted
metadata, so next
queries run faster
Database
Query
Database
Relevant
raw files
Mine metadata
Query
Database
Mine available
metadata
yes
no
Physical
storage
Pre-digested Tier
Fast but limited
Power Tier
Flexible but slower
13
Winning strategy – stage one
Composed of three layers
• Database
• Big Data Framework
• Metadata-aware
data management system
A tiered approach to
Big Data Analytics
provides the
best competitive advantage
14
• Nirvana is the metadata-aware data management system
– Provides the means for linking metadata
with raw data objects
• Three fundamental roles
– Provides standardized schema
– Manages registration of files in the database(plus updates, renames and deletions, autonomously)
– Bridges database and storage security domains(user identity and permission)
• Additionally, automated extraction of metadata from files
– Triggered on creation and update
– Extraction rules defined by system administrators
– But users can add additional metadata anytime(if authorized)
Nirvana’s role
15
• But what about Big Data SQL databases?
– e.g. Hive, Presto
• Tools like Hive are just a cost saving solution
– They do not provide capabilities not-present in
high-end “traditional” SQL databases, like Teradata
– But they do provide a better value per TByte stored
(at a moderate cost in query performance)
• They should be used as an additional tier
– Hot metadata in a “traditional” database
– Rarely used metadata in a “low cost” database
– Possibly with transparent gluing between them(e.g. Teradata QueryGrid)
Wait a minute…
16
The slides so far were assuming
a homogeneous environment
• Not a very realistic scenario
these days
A typical enterprise will have
several storage and
compute technologies deployed
• Organized into a Data Lake
Big Data Analytics in a Data Lake
17
• A single logical repository for
all data handled by an enterprise
– As opposed to having
different data in different data silos
• Logically integrated
storage and compute infrastructure
– Since data analytics requires both
• See alsohttp://www.slideshare.net/igor_sfiligoi/creating-a-real-data-lake-with-nirvana
What is a Data Lake?
18
• All the infrastructure is logically related, but
– Different technical solutions
are optimized for different factors
• e.g. speed vs reliability vs cost
– Not every compute platform will work
with every storage solution
• During Big Data Analysis, data must often
be migrated between repositories
– Often just to maximize efficiency
– Sometimes there simply is no other option
Data Lake Analytics challenges
19
• Moving data around manually not an option
• A flexible data management system essential
– Global namespace
– Transparent, fully automated
data movement and replication
– Able to interface with
solutions from multiple vendors
• And it also must be metadata-aware
– Tiered Big Data analytics needs metadata-file pairing
– These pairs must be preserved across file moves/replicas
Truly integrated infrastructure
20
Real Big Data Analytics in a picture
Can be
solved with
existing
metadata?? Answer
The appropriate
Big Data
framework
(Optionally)
Save extracted
metadata, so next
queries run faster
Database
Query
Database
Relevant
raw files
Mine metadata
Query
Database
Mine available
metadata
yes
no
Compatible
storageArchival
Storage
Locate files and
handle data
movement
(if needed)
Cloud
Storage Interactive
Storage
Logical to physical
file mapping
Pre-digested Tier
Fast but limited
Data Lake Tier
Flexible but slower
Data Management
Layer
21
Winning strategy – stage two
Big Data Analytics
over a
truly integrated Data Lake
provides the
best competitive advantage
Composed of three layers
• Database
• Data Lake
• Flexible, metadata-aware
data management system
22
• Nirvana is the flexible,
metadata-aware data management system
– Metadata capabilities described in previous slides
• Supports multiple storage technologies,
from multiple vendors
– Creates a logical, global namespace
• Fully integrated data movement
and replication capabilities
– Can be API driven
– Plus, a fully automated policy engine, too
Nirvana’s role
See also: http://www.slideshare.net/igor_sfiligoi/building-a-global-namespace-with-nirvana