hadoop in sigmod 2011
DESCRIPTION
TRANSCRIPT
Hadoop in SIGMOD 2011
2011/5/20
Papers
◦ LCI: a social channel analysis platform for live customer intelligence
◦ Bistro data feed management system◦ Apache hadoop goes realtime at Facebook◦ Nova: continuous Pig/Hadoop workflows◦ A Hadoop based distributed loading approach to
parallel data warehouses◦ A batch of PNUTS: experiences connecting cloud
batch and serving systems
Papers (Continued)
◦ Turbocharging DBMS buffer pool using SSDs◦ Online reorganization in read optimized MMDBS◦ Automated partitioning design in parallel
database systems◦ Oracle database filesystem◦ Emerging trends in the enterprise data analytics:
connecting Hadoop and DB2 warehouse◦ Efficient processing of data warehousing queries
in a split execution environment◦ SQL server column store indexes◦ An analytic data engine for visualization in
tableau
Apache Hadoop Goes Realtime at Facebook
Workload Types
Facebook MessagingHigh Write Throughput
Large Tables
Data Migration
Facebook InsightsRealtime Analytics
High Throughput Increments
Facebook Metrics System (ODS)Automatic Sharding
Fast Reads of Recent Data and Table Scans
Why Hadoop & HBase
Elasticity
High write throughput
Efficient and low-latency strong consistency semantics within a data center
Efficient random reads from disk
High Availability and Disaster Recovery
Fault Isolation
Atomic read-modify-write primitives
Range Scans
Tolerance of network partitions within a single data center
Zero Downtime in case of individual data center failure
Active-active serving capability across different data centers
Realtime HDFS
High Availability - AvatarNodeHot Standby – AvatarNode
Enhancements to HDFS transaction logging
Transparent Failover: DAFS(client enhancement+ZooKeeper)
Hadoop RPC compatibility
Block Availability: Placement Policya pluggable block placement policy
Realtime HDFS (Cont.)
Performance Improvements for a Realtime WorkloadRPC Timeout
Recover File LeaseHDFS-appendrecoverLease
Reads from Local Replicas
New FeaturesHDFS sync
Concurrent Readers (last chunk of data)
Production HBase
ACID Compliance (RWCC: Read Write Consistency Control)Atomicity (WALEdit)
Consistency
Availability ImprovementsHBase Master Rewrite
Region assignment in memory -> ZooKeeper
Online Upgrades
Distributed Log Splitting
Performance ImprovementsCompaction
Read Optimizations
Deployment and Operational Experiences
TestingAuto Tesing Tool
HBase Verify
Monitoring and ToolsHBCK
More metrics
Manual versus Automatic Splitting
Add new RegionServers, not region splitting
Dark Launch (灰度 )
Dashboards/ODS integration
Backups at the Application layer
Schema Changes
Importing DataLzo & zip
Reducing Network IOMajor compaction
Nova: Continuous Pig/Hadoop Workflows
Nova Overview
ScenariosIngesting and analyzing user behavior logs
Building and updating a search index from a stream of crawled web pages
Processing semi-structured data feeds
Two-layer programming model (Nova over Pig)Continuous processing
Independent scheduling
Cross-module optimization
Manageability features
Abstract Workflow Model
WorkflowTwo kinds of vertices: tasks (processing steps) and channels (data containers)
Edges connect tasks to channels and channels to tasks
Edge annotations (all, new, B and Δ)
Four common patterns of processingNon-incremental (template detection)
Stateless incremental (shingling)
Stateless incremental with lookup table (template tagging)
Stateful incremental (de-duping)
Abstract Workflow Model (Cont.)
Data and Update ModelBlocks: base blocks and delta blocks
Channel functions: merge, chain and diff
Task/Data InterfaceConsumption mode: all or new
Production mode: B or Δ
Workflow Programming and Scheduling
Data Compaction and Garbage Collection
Nova System Architecture
Efficient Processing of Data Warehousing Queries in a Split Execution Environment
Introduction
Two approachesStarting with a parallel database system and adding some MapReduce features
Starting with MapReduce and adding database system technology
HadoopDB follows the second of the approaches
Two heuristics for HadoopDB optimizationsDatabase systems can process data at a faster rate than Hadoop.
Minimize the number of MapReduce jobs in SQL execution plan.
HadoopDB
HadoopDB ArchitectureDatabase Connector
Data Loader
Catalog
Query Interface
VectorWise/X100 Database (SIMD) vs. PostgreSQL
HadoopDB Query Executionselection, projection, and partial aggregation(Map and Combine) database system
co-partitioned tables
MR for redistributing data
SideDB (a "database task done on the side").
Split Query Execution
Referential PartitioningJoin in database engine
Local join
foreign-key Referential Partitioning
Split MR/DB JoinsDirected join: one of the tables is already partitioned by the join key.
Broadcast join: small table ought to be shipped to every node.
Adding specialized joins to the MR framework Map-side join.
Tradeoffs: temporary table for join.
Another type of join: MR redistributes data Directed join
Split MR/DB Semijoin like 'foreignKey IN (listOfValues)'Can be split into two MapReduce jobs
SideDB to eliminate the first MapReduce job
Split Query Execution (Cont.)
Post-join AggregationTwo MapReduce jobs
Hash-based partial aggregation save significant I/O
A similar technique is applied to TOP N selections
Pre-join AggregationFor MR based join.
Group-by and join-key columns is smaller than the cardinality of the entire table.
A Query Plan in HadoopDB
Performance No hash partition feature in Hive
Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 Warehouse
DB2 and Hadoop/Jaql Interactions
A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses
Introduction
Why Hadoop for Teradata EDWMore disk space and space can be easily added
HDFS as a storage
MapReduce
Distributed
HDFS blocks to Teradata EDW nodes assignment problemParameters: n blocks, k copies, m nodes
Goal: to assign HDFS blocks to nodes evenly and minimize network traffic
Block Assignment ProblemHDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P)
The problem is defined by: assignment(X, Y, n,m, k, r)
X is the set of n blocks (X = {1, . . . , n}) of F
Y is the set of m nodes running PDBMS (called PDBMS nodes) (Y⊆ {1, . . . , P })
k copies, m nodes
r is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.
An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1, . . . , n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j.
Block Assignment Problem (Cont.)
The problem is defined by: assignment(X, Y, n,m, k, r)
An even assignment g is an assignment such that ∀ i ∈ Y ∀ j ∈ Y | |{ x | 1 ∀≤ x ≤ n&&g(x) = i}| - |{y | 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1. ∀
The cost of an assignment g is defined to be cost(g) = |{i | g(i) /∈ r(i) 1 ≤ i ∀≤ n}|, which is the number of blocks assigned to remote nodes.
We use |g| to denote the number of blocks assigned to local nodes by g. We have |g| = n - cost(g).
The optimal assignment problem is to find an even assignment with the smallest cost.
OBA algorithm(X, Y, n,m, k, r)=({1, 2, 3}, {1, 2}, 3, 2, 1, {1 → {1}, 2 → {1}, 3 → {2}})