architectures styles and deployment on the hadoop

Architectural Patterns and Best Practices : #BigData #HadoopSrividhya Balasubramaniam @ Data and Information Management [email protected]

Ice Breaker120 Sec Shhhhh!

Agenda• Why are enterprises re-thinking on their data

strategy• Modernizing Enterprise Data Warehouses• Architectural Patterns and Design Consideration• Best Practices

Analytics Architecture

Application Architecture

Platform Architecture

“Because we have been doing stuff this way for ages!…… ”is not the normRe-Think!

Drivers of Change What Has not changedDATA QUALITY AND GOVERNANCE

INFORMATION SECURITY

METADATA MANAGEMENT

DATA SOURCES

DATA STORE

DATA ACCESS

ORCHESTRATION AND SCHEDULING

Challenges? Velocity , Variety and Volume

What is the Right Tool? How should I use the tool

Reference Architecture?

What Language and tool should I learn

Why?Why? Why? Why?

What's like data modelling in Hadoop Buy or build?

Core Design Principles What Business Problem is being Solved? Define Tool Selection Criteria Decouple processing store and systems Hybrid Architecture Leverage Batch and

Stream Scalable, Reliable, Fit for Purpose, Secure Available, Very low Admin Cost Supportable and Operations Monitoring Best Design is cheap

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BI

Storage of Messaging and StreamingCriteria

1. How Distributed Services are managed2. Guaranteed Ordering3. Data Delivery4. Data Retention Period5. Availability6. Scalability7. Throughput8. Parallel Clients9. Object Size10.Stream Map Reduce11.Cost

Eg: Apache Kafka• Guranteed Ordering,

Parallel Client and Stream MR

• Configurable Data Retention, Availability, Object Size

• Low cost but more admin






STORE





Databases What DB Export to choose

1. File Size2. Network Bandwidth3. Partitioning4. Bulk Loading5. CDC and Delta Data Transfers6. Native connectors and specific

connectors for Distribution

Adaptors and Golden Gate etc.






STORE





Data Storage – Distributed Files Criteria

1. Average Latency2. Typical Data Stored3. Typical Item Size4. Request Rate5. Storage Cost PerGB / timeframe6. Durability7. Availability8. Native support for toolsets9. Active community and open source

Enterprise Distributions Selection

Clouders, Hortonworks, MapR






STORE




•Visualization•Self Service BIData Storage Selection Criteria

Data Structure : Fixed , Key Value, JSONAccess Patterns : Hierarchical, Structured, Search, Publish etcData Temperature : Hot, Warm ColdTCO : Low

Elastic Cache






STORE




•Visualization•Self Service BIData Storage Selection Criteria

Cache NoSQL SQL Search 1. Average Latency (ms, sec, min, hours)2. Typical Volume Stored (GB, TB, PB)3. Typical Item Size (B, KB, TB, PB)4. Query Request Rate (High to Very Low)5. Storage and Maintenance Cost (High – Low)6. Durability (Low – Very High)7. Availability (High – Very High)

Data Structure : Fixed , Key Value, JSONAccess Patterns : Hierarchical, Structured, Search, Publish etcData Temperature : Hot, Warm ColdTCO : Low






STORE





BATCH INTERACTIVE STREAMING MESSAGING

Machine Learning

Spark MLEMR etc

Criteria1. Programming Language

Support2. Availability3. Speed4. Scale5. Latency Query6. Data Volume7. Storage Support8. SQL?

Temperature of Data






STORE





Buy Vs Build ETL Decision?






STORE





Create Analytical Application

Make Insights Available Via API

Analysis and Visualization

Zepplin, HUE etc

Publish to Queue

Data Modelling in Hadoop & Architectural Patterns

Not only ER and Dimension Models (NoERDM)

Data Storage Format

TextSequenceAvroParquetRC/ORC

Know strength and weakness of each format in terms of Supporting DistributionsProcessing requirements – Write, partial read, full readSchema EvolutionExtract RequirementsStorage Requirements – How big are your filesHow important is file splitabilityDoes block compression matterDoes the file format support indexing?How easy it is to parseDoes it support column Stats?Failure behavior for various file formats.

Not only ER and Dimension Models (NoERDM)Compression

CodecsZLIBLZOLZFSnappyGzipBzip

ConsiderationsHow much the size reducesHow fast it can compress decompressHow can I split my compressed files? File splitbility to make use of parallelism

Compression typesUncompressedRecord compressed. Block Compressed. `

We trade I/O Loads for CPU Loads

Other Practices1. Structure and Organize your repository

a. Standard directory structureb. Access quota controlsc. Stage area conventions

2. Location of HDFS filesa. Directory structure should simplify the assignment of permissions to be grated.b. Eg /user, /etl , /tmp, /data, /app, /metadata,

3. Partitioning, Bucketing and denormalization.

Data Lake / Reservoir / Refinery

Exploratory Data Analysis

Application Level AnalyticsBatch and Stream Analytics – Lambda Architecture

Enterprise Data Pipeline

Architectural Patterns

Thank You!Questions?

architectures styles and deployment on the hadoop

Career