big data technology
TRANSCRIPT
2 / 31
Roadmap
● Map Reduce● Hadoop
● Concepts● HDFS● Architecture
● Hadoop Ecosystem● Lambda Architecture● New trends
4 / 31
M/R: What is it?
● Different programming paradigm● Based on a google paper (2004)● Automatic parallelization and distribution● I/O Scheduling● Fault tolerance● Status and monitoring
5 / 31
M/R: The paradigm
● Input & Output: set of key/value pairs● Big amount of data group & sort● Job = Two phases = Mapper & Reducer
● Map (in_key, in_value) →list(interm_key, interm_value)
● Reduce (interm_key, list(interm_value)) →list (out_key, out_value)
8 / 31
Roadmap
● Map Reduce● Hadoop
● Concepts● HDFS● Architecture
● Hadoop Ecosystem● Lambda Architecture● New trends
9 / 31
Hadoop: What is it?
● Framework based on GMR / GFS● Apache project● Developed in Java● Multiple applications● Used by many companies● De-facto standard in community
11 / 31
Hadoop: HDFS concepts
● Distributed file system. Layer on top ext3, xfs...● Works better on huge files● Redundancy (default 3)● Bad seeking, no append!● Good rack scale. Not good data center scale● File divided in 128Mb – 256Mb blocks● Computation is sent to data!
17 / 31
Hadoop: Advanced
● Distributed caches● Partitioner● Sort comparator● Group comparator● Combiner● Input format & Record reader● MultiInput● MultiOutput● Compression
18 / 31
Hadoop: Conclusions
● Simplify large-scale computation● Hide parallel programming issues● Easy to get into & develop (huge doc)● Deeply used & maintained by community● Possibility to throw away RDBMs! (Bottleneck)
19 / 31
Roadmap
● Map Reduce● Hadoop
● Concepts● HDFS● Architecture
● Hadoop Ecosystem● Lambda Architecture● New trends
21 / 31
Roadmap
● Map Reduce● Hadoop
● Concepts● HDFS● Architecture
● Hadoop Ecosystem● Lambda Architecture● New trends
22 / 31
Lambda Architecture: Motivation
● Real time use cases● Business analytics
● Batch processing vs Real Time● Problem!
● Low latency read & update● Scalable & fault tolerant● Something else needed!
26 / 31
Lambda Architecture: Lambdoop
● Unified technology stack● High level programming environment● Management tools
27 / 31
Roadmap
● Map Reduce● Hadoop
● Concepts● HDFS● Architecture
● Hadoop Ecosystem● Lambda Architecture● New trends
30 / 31
New trends: Spark
● Next generation MapReduce● Integrated but not dependent on Hadoop● Fast memory optimized execution engine● Avoids many Hadoop problems
● Overhead● High latency● Many disk writes
● In-memory cache● Flexible executions graph● Much faster than MapReduce (up to 100x)● Shark (SQL)● Support streaming (beta)
31 / 31
BIG DATA TECHNOLOGY
● Juanjo Mostazo● [email protected]● http://www.slideshare.net/juanjmostazo/mr-hadoop-cbase