iota architecture: data virtualization and processing … 2017 iota... · ready for direct querying...

26
IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Upload: others

Post on 18-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

IOTA ARCHITECTURE: DATA VIRTUALIZATION

AND PROCESSING MEDIUM

DR. KONSTANTIN BOUDNIKDR. ALEXANDRE BOUDNIK

Page 2: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

DR. KONSTANTIN BOUDNIK

• Over 20+ years of expertise in distributed systems, big- and fast-data platforms

• Apache Ignite Incubator Champion • Author of 17 US patents in distributed

computing• A veteran Apache Hadoop developer• Co-author of Apache Bigtop, used by Amazon

EMR, Google Cloud Dataproc, and other major Hadoop vendors

• Co-author of the book "Professional Hadoop”

EPAM SYSTEMSCHIEF TECHNOLOGIST BIGDATA,

OPEN SOURCE FELLOW

DR.KONSTANTIN BOUDNIK

Page 3: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

DR. ALEXANDRE BOUDNIK

• Over 25 years of expertise in compilers, query engine for MPP development, computer security, distributed systems, Big Data and Fast Data

• Architect and Visionary at EPAM’s BigData CC• Focusing is on scalable, fault tolerant

distributed share-nothing clusters• Led projects for financial and banking

industries with intensive distributed in-memory calculations

EPAM SYSTEMSLEAD SOLUTION ARCHITECT

BIG& FAST DATA

DR.ALEXANDRE BOUDNIK

Page 4: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Modern data-processing architecturesIn-memory Data FabricIota in action: virtual data platformUse cases

AGENDA

Page 5: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Don’t separate batch and stream data processingCompute should be co-located with dataData mutations have to be trackedData concurrency is annoying

That’s it: you can go now

EVERYTHING IS IN ONE SLIDETHE REST IS MERE DETAILS

Page 6: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

• Lambda (λ): an anonymous function (closure)def greeting = { it -> "Hello, $it!" } assert greeting('SEC 2017') == 'Hello, SEC 2017!'

• PaaS server-less architecture (AWS Lambda and alike)exports.handler = function (event, context) { context.succeed('Hello, SEC 2017!');};

NOT ALL LAMBDAs ARE EQUALGreek alphabet needs more letters

Page 7: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

• Consists of three main layers1. High-latency layer for historical2. Speed layer for recent/stream

data3. Smart reconciliation layer

• PropertiesImmutable, one-way data ingest

• Drawbacks• Data accuracy is an issue• High operational complexity

LAMBDA: QUICK OVERVIEW

1

3

2

Page 8: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Simplified to1. Streaming source2. Streaming processing3. Stream-only serving DB

PropertiesHistorical processing is a streamReprocessing is just a stream job

Drawbacks• (Re)streaming of the historical data on replay• Moderate operational complexity

SOME LAMBDAs ARE KAPPAs

132

Page 9: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

• Processing (Lambda) architecture for slow and fast data

NEXT TO EACH OTHER

Batch (slow): ’Hello, ’

Events Stream (fast): ’I’,’M’,’C’,’S’,’ ’,’2’,’0’,’1’,’7’,’!’Serving DB

(to reconcile)

• Some Lambdas are really Kappas

Events

Stream Processor: ’Hello’, ’I’,’M’,’C’,’S’,’ ’,’2’,’0’,’1’,’7’,’!’ Serving DB

(up-to-date)Code change: repocessing Catch-up

Code change: repocessing

Page 10: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

IN-MEMORY DATA FABRICPICTURE OR IT NEVER HAPPEND

• Separation of concerns• Sources• Consumers• Abstraction and

processing

Page 11: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Data Fabric is a unified view of data in multiple systemsA layer for data access

Low redundancy; few data movementsWrite-through caching (might violate legacy app data integrity)

Affinity sensitive compute mediumHighly-available and fault tolerantVariety of APIs and integration with BigData

IN-MEMORY DATA FABRICIN A NUTSHELL

Page 12: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

NEXT STEP: IOTABIGMEMORY

In-Memory Data Fabric

Events

RDBMSCloud

storage DFS

Cache

Batch

Real-time

Page 13: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Don’t separate batch and stream data processingCompute should be co-located with dataData mutations have to be tracked (watched and versioned)Data concurrency is annoying

A STEP TOWARDS THE DATA

Page 14: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Data state, persistency and immutabilityMisperception of data primacy – what is the main copy?Versioning of data, data structures, code and metadataUniform data access, Multi-structured dataGranular data access rights and securityETL/ELT & Data Marts, Data lifecycle

ISSUES OF DATA STORING & PROCESSING

Page 15: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

TWO BREEDS OF DATAWAREHOUSES

Provides higher performanceIntegrates Data from heterogeneous sourcesSimplifies analyses: Data are ready for direct queryingExtra storage for copied dataComplex CDC for each data source

Update-Driven

Builds wrappers/mediators on top of heterogeneous databasesTranslates query to data-source specificSingle-Source-of-Truth practiceComplex information filteringMassive data pull from data sources

Heterogeneous Query-Driven

Page 16: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Query-Driven Warehouse borrowed from BigData:On demand extraction from schema-on-read dataAvoids complex ETLs

BigData addresses high query costs of Query-Driven Warehouse:

Read less data: partitioningLesser shuffle: share nothing, collocation, local filtering (pushdown)

Requires sophisticated extendable metadata

BIGDATA & QUERY-DRIVEN WAREHOUSE

Page 17: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Primary Data are nondeterministic, non-reproducible and UNIQUEpersistent and immutable

Derived Data are deterministic and reproducible EXACTLYephemeral and immutable

Versioned metadata are Primary by its naturepersistent and immutable

Versioned Code is Primary by its naturepersistent and immutable

All abovementioned are immutable and therefor, STATELESS!

TWO BREEDS OF DATAPRIMARY & DERIVED

Page 18: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

No data concurrency issuesMajority of transactions are RAMP

Leveraging functional programming paradigm (lambda again!)Read-through & memoizationHigher re-use of the code

Avoiding complex ETLsOn-demand extraction from schema-on-read data

BENEFITS OF STATELESSNESS

Page 19: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Persistent WORM stores (Write Once Read Many)

Primary dataMetadata & Code

Transient Cache storesDerived data

Compute EngineReads WORM & CacheProduces resultsPuts results to Cache

MOVING PARTS

Page 20: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

PARTITIONING VS PATCHWORKHOW TO READ LESS

• Partitions: statically defined in DDL• Patchworks: arbitrary structure of

dynamically built patches

Page 21: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Data Blocks:Describe a quantum of dataA set of semantically similar objects, limited by some dimensionsA URI: ftp, web, files, a parametrized SQL SELECT

Data Catalog:A part of versioned metadataOrganizes Data Blocks into a PatchworkIs a functional equivalent of RDBMS catalog

PATCHWORKDATA BLOCKS & DATA CATALOG

Page 22: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Cache is transparent and transient by its nature:Holds function results, instead of actual callsMight hold Data Blocks

Cache Entry includes Key, Value, and Statistics:last time value was accessed and how often (frequency)dependency depthresources spent, like CPU and IOs

Retention & Eviction:Is based on Cache Entry statisticsThe dependency graph’ Data Blocks are evicted with root entry

CACHE

Page 23: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

• Dependency graph is built from data access’ history:• Could be replaced by a reference to Data Block (compacted)

• Invalidation & Lineage is driven by dependency graph• Functions: follow memoization pattern• Scalability – just put more boxes there, if:

• WORM uses distributed Key-Value storage• Cache & Calculation engine use In-Memory Data Fabric

MISCELLANEOUS ASPECTS

Page 24: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Better data lakes: bi-directional data movementsMinimal networking, Memory-centric, Integration with legacy

Real-time personalizationBetter shopping with mobile devices, Location-based marketingNear real-time promotions, Advanced analyticsSimplified ML-driven CEP

Fraud detectionDiscovery of complex fraud patterns, based on historical dataReal-time detection of abnormal behaviorSimplified ML-driven CEP

USE CASES

Page 25: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

• Avoiding multiple copies of the data, instant consistency• In-memory caching with read-ahead/write-behind support• Batch, streaming, CEP, and (near) real-time processing• Speeding up a traditionally slow, batch oriented frameworks• Variety of data processing: read-only, read-write, transactional• Lower inter-component impedance

IOTA BENEFITS

Page 26: IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING … 2017 Iota... · ready for direct querying Extra storage for copied data Complex CDC for each data source Update-Driven

Q & A