architectures styles and deployment on the hadoop
TRANSCRIPT
Architectural Patterns and Best Practices : #BigData #HadoopSrividhya Balasubramaniam @ Data and Information Management [email protected]
Ice Breaker120 Sec Shhhhh!
Agenda• Why are enterprises re-thinking on their data
strategy• Modernizing Enterprise Data Warehouses• Architectural Patterns and Design Consideration• Best Practices
Analytics Architecture
Application Architecture
Platform Architecture
“Because we have been doing stuff this way for ages!…… ”is not the normRe-Think!
Drivers of Change What Has not changedDATA QUALITY AND GOVERNANCE
INFORMATION SECURITY
METADATA MANAGEMENT
DATA SOURCES
DATA STORE
DATA ACCESS
ORCHESTRATION AND SCHEDULING
Challenges? Velocity , Variety and Volume
What is the Right Tool? How should I use the tool
Reference Architecture?
What Language and tool should I learn
Why?Why? Why? Why?
What's like data modelling in Hadoop Buy or build?
Core Design Principles What Business Problem is being Solved? Define Tool Selection Criteria Decouple processing store and systems Hybrid Architecture Leverage Batch and
Stream Scalable, Reliable, Fit for Purpose, Secure Available, Very low Admin Cost Supportable and Operations Monitoring Best Design is cheap
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BI
Storage of Messaging and StreamingCriteria
1. How Distributed Services are managed2. Guaranteed Ordering3. Data Delivery4. Data Retention Period5. Availability6. Scalability7. Throughput8. Parallel Clients9. Object Size10.Stream Map Reduce11.Cost
Eg: Apache Kafka• Guranteed Ordering,
Parallel Client and Stream MR
• Configurable Data Retention, Availability, Object Size
• Low cost but more admin
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BI
Databases What DB Export to choose
1. File Size2. Network Bandwidth3. Partitioning4. Bulk Loading5. CDC and Delta Data Transfers6. Native connectors and specific
connectors for Distribution
Adaptors and Golden Gate etc.
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BI
Data Storage – Distributed Files Criteria
1. Average Latency2. Typical Data Stored3. Typical Item Size4. Request Rate5. Storage Cost PerGB / timeframe6. Durability7. Availability8. Native support for toolsets9. Active community and open source
Enterprise Distributions Selection
Clouders, Hortonworks, MapR
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BIData Storage Selection Criteria
Data Structure : Fixed , Key Value, JSONAccess Patterns : Hierarchical, Structured, Search, Publish etcData Temperature : Hot, Warm ColdTCO : Low
Elastic Cache
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BIData Storage Selection Criteria
Cache NoSQL SQL Search 1. Average Latency (ms, sec, min, hours)2. Typical Volume Stored (GB, TB, PB)3. Typical Item Size (B, KB, TB, PB)4. Query Request Rate (High to Very Low)5. Storage and Maintenance Cost (High – Low)6. Durability (Low – Very High)7. Availability (High – Very High)
Data Structure : Fixed , Key Value, JSONAccess Patterns : Hierarchical, Structured, Search, Publish etcData Temperature : Hot, Warm ColdTCO : Low
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BI
BATCH INTERACTIVE STREAMING MESSAGING
Machine Learning
Spark MLEMR etc
Criteria1. Programming Language
Support2. Availability3. Speed4. Scale5. Latency Query6. Data Volume7. Storage Support8. SQL?
Temperature of Data
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BI
Buy Vs Build ETL Decision?
Typical Data Pipeline
Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM
Store Raw• DATABASE• SEARCH
DOCUMENTS• DIST FILE
STORAGE• QUEUE• STREAM
STORE
Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING
Store•Key Value•Graph•Document•Queue•MPP
Insights•Analytical Models
•Visualization•Self Service BI
Create Analytical Application
Make Insights Available Via API
Analysis and Visualization
Zepplin, HUE etc
Publish to Queue
Data Modelling in Hadoop & Architectural Patterns
Not only ER and Dimension Models (NoERDM)
Data Storage Format
TextSequenceAvroParquetRC/ORC
Know strength and weakness of each format in terms of Supporting DistributionsProcessing requirements – Write, partial read, full readSchema EvolutionExtract RequirementsStorage Requirements – How big are your filesHow important is file splitabilityDoes block compression matterDoes the file format support indexing?How easy it is to parseDoes it support column Stats?Failure behavior for various file formats.
Not only ER and Dimension Models (NoERDM)Compression
CodecsZLIBLZOLZFSnappyGzipBzip
ConsiderationsHow much the size reducesHow fast it can compress decompressHow can I split my compressed files? File splitbility to make use of parallelism
Compression typesUncompressedRecord compressed. Block Compressed. `
We trade I/O Loads for CPU Loads
Other Practices1. Structure and Organize your repository
a. Standard directory structureb. Access quota controlsc. Stage area conventions
2. Location of HDFS filesa. Directory structure should simplify the assignment of permissions to be grated.b. Eg /user, /etl , /tmp, /data, /app, /metadata,
3. Partitioning, Bucketing and denormalization.
Data Lake / Reservoir / Refinery
Exploratory Data Analysis
Application Level AnalyticsBatch and Stream Analytics – Lambda Architecture
Enterprise Data Pipeline
Architectural Patterns
Thank You!Questions?