Performance Management in ‘Big Data’ ApplicationsIt’s still about the Application
Michael Kopp, Technology Strategist
@mikopp
blog.dynatrace.com
Edward Capriolo
@edwardcapriolo
m6d.com/blog
High Volume/Low Latency DBs
3
JavaWeb
BigData
Key Benefits1) Fast Read/Write2) Horizontal Scalability3) Redundancy and High
Availability
Key Challenges1) Even Distribution2) Correct Schema and Access
patterns3) Understanding Application Impact
Hive high-levelmap/reducequery JOB
123
BigData
JOB
1234...
754Key Benefits1) Massive Horizontal Batch Job2) Split big Problems into smaller
ones
Hive Server
batchtrigger
Large Parallel Batch Processing
Key Challenges1) Optimal Distribution 2) Unwieldy Configuration3) Can easily waste your
resources
What is m6d?
5
6
Impressions look like…
7
Map Reduce Performance
8
Typical MapReduce Job at m6d
9
Hadoop at m6d
• Critical piece of infrastructure
• Long Term Data Storage
– Raw logs
– Aggregations
– Reports
– Generated data (feed back loops)
• Numerous ETL (Extract Transform Load)
• Scheduled and adhoc processes
• Used directly by Tech-Team, Ad Ops, Data Science
Hadoop at m6d
• Two deployments 'production' and 'research'– ~ 500 TB - 40+ Nodes
– ~ 350 TB – 20+ Nodes
• Thousands of jobs – <5 minute jobs and 12 hour Job Flows
– Mostly Hive Jobs
– Some custom code and streaming jobs
Hadoop Design Tenants
• Linear scalability by adding more hardware
• HDFS Distributed file system
– User space file system
– Blocks are replicated across nodes
– Limited semantics
• MapReduce
– Paradigm which models using map/reduce
– Data Locality
– Split Job into Tasks by Data
– Retry in failure
12
Schema Design Challenges
• Partition data for good distribution
– By time interval (optionally a second level)
• Partition pruning with WHERE
– Clustering (aka bucketing)
• Optimized sampling and joins
– Columnar
• Column oriented • Raw Data Growth
• Data features change (more distinct X)
13
Key Performance Challenges
• Intermediate I/O
– Compression codec
– Block size
– Split-table formats
• Contentions between jobs
• Data and Map/Reduce Distribution• Data Skew
• Non Uniform Computation (long running tasks)
• ‘Cost' of new feature – is this justified?
• Tuning variables (spills, buffers, Etc, etc)
14
How to handle Performance Issues?
• Profile the Job / Query?– Who should do this?
(DBA, Dev, Ops, DevOps , NoOps, Big Data Guru)
– How should we do this?• Look at job run times day over day?
• Look at code and micro-benchmark?
• Collect Job Counters?
• Upgrade often for latest performance features?
• Investigate/purchase newer better hardware– More cores? RAM? 10G Ethernet? SSD
• Read blogs?Test Data is not like
Real Data
15
But how to optimize the job itself?
16
Understanding Map/Reduce Performance
Maximum Parallelism
Actual Mapping Parallelism
Also your own Code
Attention Data Volume!
Attention Potential Choke
Point!
Maximum Reduce
Parallelism
Actual Reduce Parallelism
Also your own Code
Millions of Executions!!!
Understanding Map/Reduce Performance
18
Map/Reduce Performance
19
Map/Reduce behind the scenesSerialize
De-Serialize and Serialize
again
Potentionally Inefficient
Too Many Files, Same Key
spread all over
Expensive Synchronous
Combine
De-Serialize and Serialize
again
20
Map/Reduce Combine and Spill Performance
1) Pre Combine in Mapping Step2) Avoid many intermediate files and combines
21
Map/Reduce “Map” Performance
Focus on Big HotspotsAvoid Brute ForceSave a lot of HardwareThen Optimize Hadoop
22
Map/Reduce to the Max!
• Ensure Data Locality
• Optimize Map/Reduce Hotspots
• Reduce Intermediate Data and “Overhead”
• Ensure optimal Data and Compute Distribution
• Tune Hadoop Environment
23
Cassandra and
Application Performance
24
1. Browsers visit Publishers and create impressions.2. Publishers sell impressions via Exchanges.3. Exchanges serve as auction houses for the impressions4. On behalf of the marketer, m6d bids the impressions via
the auction house. If m6d wins, we display our ad to the browser.
A High Level look at RTB
25
Cassandra at m6d for Real Time Bidding
• RTB limited data is provided from exchange
• System to store information on users
– Frequency Capping
– Visit History
– Segments (product service affinity)
• Low latency Requirements
– Less then 100ms
– Requires fast read/write on discrete data
26
Cassandra design
Key Cassandra Design Tennents
• Swap/paging not possible
• Mostly schema-less
• Writes do not read– Read/Write is an anti-pattern
• Optimize around put and get– Not for scan and query
• De-Normalize data– Attempt to get all data in single read*
28
Cassandra Design Challenges
• De-normailize
– Store data to optimize reads
– Composite (multi-column) keys
• Multi-column family and Multi-tenant scenarios
• Compress settings
– Disk and cache savings
– CPU and JVM costs
• Data/Compaction settings
– Size tiered vs LevelDB
• Caching, Memtable and other tuning
29
How to handle performance issues?
• Monitor standard vitals (cpu,disk) ?
• Read blogs and documentation?
• Use Cassandra JMX to track req/sec
• Use Cassandra JMX to track size of Column Families, rows and columns
• Upgrade often to get latest performance enhancements? *
What about the Application?
30
APM for Cassandra
NoSQL APM is not so different after all…
31
JavaWeb
Key APM Problems Identified1) Response Time Contribution2) data access patterns3) transaction to query
relationship (transaction flow)
Database
32
Response Time Contribution
Access PatternAccess PatternAccess Pattern
Contribution to Business Transaction Connection Pool
33
Statement Analysis
Contribution to Business Transaction
Executions per Transactions and
Total
Average and Total Execution Time
34
Where, Why, How and which Transaction…
Where and why in my Transaction
Single Statement Performance
Which Web Service
Which Business Transaction
How does this apply to NoSQL Databases?
35
Key APM Problems Identified1) Response Time Contribution2) data access patterns3) transaction to query
relationship (transaction flow)
1) Data Access Distribution2) End-to-End Monitoring3) Storage (I/O, GC) Bottlenecks4) Consistency Level
JavaWeb
37
Real End-to-End Application Performance
Third Party
Services
External
End User
Our Application
End User Response Time Contribution
38
Understanding Cassandra’s Contribution
Which statements did the Transaction Execute?Which node where they executed against?Which Consistency Level was used?
Contribution of each StatmentToo many calls? Data Access patterns
39
Understand Response Time Contribution
4 Calls~15 ms Contribution
5 Calls~50-80 ms Contribution?
Access and Data Distribution
40
Why and how was a statement executed?
45ms latency? 60ms waiting on the server?
41
Any Hotspots on the Cassandra Nodes?
Much more load on Node3?Which Transactions are
responsible
42
Specific Cassandra Health Metrics
43
General Health of Cassandra
Too much GC Suspensions?
Memory Issues?
Too many requests?
44
Conclusion
45
Extend Performance Focus on Application
JavaWeb
A Fast Database doesn’t make a fast Application
Hive high-levelmap/reducequery JOB
123
JOB
1234...
754
master node
Hive Server
batchtrigger
data/task node
data/task node
Intelligent MapReduce APM
data/task node
Simple Optimizations with big impact
47
Big Data is about solving Application Problems
APM is about Application Performance and Efficiency
THANK YOU
48
Michael Kopp, Technology Strategist
@mikopp
blog.dynatrace.com
Edward Capriolo
@edwardcapriolo
m6d.com/blog