review of calculation paradigm and its components
TRANSCRIPT
Review of Calculation Paradigm and its
ComponentsNamuk Park
Nov 18, 2014
Hadoop File System
Hadoop: MapReduce
Hadoop 2.0
• improve scalability
• to support non-mapreduce job
• heterogeneous machine
• common scenarios for low cluster utilization: maps slots might be full while reduce slots are empty, and vice-versa
Hadoop 2.0
Hadoop 2.0: Service Layers
YARN
• split up the two functions of the JobTracker, resource management and job scheduling/monitoring
• to have a global Resource Manager (RM) and per-application ApplicationMaster (AM)
YARN: MapReduce
Storm
Storm
public class WordCountTopology { {……} public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Config conf = new Config(); conf.setDebug(true);
if (args != null && args.length > 0) { conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); }}
Storm Architecture
Storm Architecture
Lambda Architecture
query = function (all datum)
Lambda Architecture
Tez
Low Level DAG Framework
• to execute a complex DAG of tasks
• more general-purpose resource management framework
Tez: Runtime API
Pig: ConceptsNon-blocking operators
• LOAD / STORE
• FOREACH __ GENERATE __
• FILTER __ BY __
Blocking operators
• GROUP __ BY __
• ORDER __ BY __
• JOIN __ BY __
Translated to a MapReduce shuffle
Pig: Problems
Restrictions by MapReduce
• Extra intermediate output on HDFS
• Artificial synchronization barriers
• Inefficient use of resources
• Multi-query optimization
Pig: on Tez
Pig: Tez DAG
Pig: Strategies
• AM/Container Reuse
• Broadcast Edge, Object Cache
• Vertex Group
• Slow Start, Pre-launch
Pig: Performance
Pig: Performance
Pig: Performance
Complex Event Processing: Problems
• fungible data
• EDA: event-driven SOA
• EDA requires non-pipeline complex
Complex Event Processing: Paradigm
Task Tracker Task TrackerTask TrackerTask Tracker
Job Tracker
datadata data
pipeline
Task Tracker Task TrackerTask TrackerTask Tracker
Job Tracker
data data
data Message Coordinatordatadata
independent
References
• Hadoop YARN: The Architectural Center of Enterprise Hadoop
• Lambda Architecture
• Apache Pig를 위한 Tez 연산 엔진 개발하기