Download - Small intro to Big Data
Outline1. What is Big Data?2. Storing3. Batch & Streams processing4. Resource Managers5. Machine Learning6. Analysis & Visualization7. Other
CAP theorem(Brewer’s theorem)
In distributed system you can only
have two of three guarantees:
● Consistency
● Availability
● Partition Tolerance
Relational scaling(horizontal)
Example limitations:
● Max 48 nodes
● Read-only nodes
● Cross-shard joins…
● Auto-increments
● Distributed transactions,
possible, but…
It can work!
NoSQL(Not only SQL)
● Key-value (Redis, Dynamo, ...)
● Column (Cassandra, HBase, ...)
● Document (MongoDB, … )
● Graph (Neo4J, …
● Multi-model (OrientDB, …)
Apple - 115k Cassandra nodes with
over 10PB of data!
Batch Processing● Processes data from 1 or more
sources from bigger period of
time (e.g. day, month)
● Source: db, Apache Parquet, ...
● Not real-time
● Can take hours or more
Apache Hadoop● Based on Google paper
● First release in 2006!
● Map -> (shuffle) -> Reduce
● Was the beginning of many
projectsMapReduce
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Hadoop Wordcount - part I
Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Hadoop Wordcount - part II
Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Hadoop Wordcount - part III
Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Apache SparkRDD (Resilient Distributed
Dataset)
DAG (Directed acyclic graph)
● RDD - map, filter, count etc
● Spark SQL
● MLib
● GraphX
● Spark Streaming
● API: Scala, Java, Python, R*
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark Wordcount
Source: http://spark.apache.org/examples.html
Apache FlinkDAGs with iterations
● Batch & native streaming
● FlinkML
● Table API & SQL (Beta)
● Gelly - graph analytics
● FlinkCEP - detect patterns in
data streams
● Compatible with Apache
Hadoop and Apache Storm APIs
● API: Scala, Java, Python*
Stream processing
● Near real-time/real time
● Processing (usually) does not
end
● Source: files, Apache Kafka,
Socket, Akka Actors, Twitter,
RabbitMQ etc
● Event time vs processing time
● Windows - fixed, sliding, session
● Watermarks
● State
Stream processing
● Native/micro-batch
● Latency
● Throughput
● Delivery guarantees
● Resources managers
● API - compositional/declarative
● Maturity
Differences
Stream processing● Apache Storm
● Apache Storm Trident
● Alibaba JStorm
● Twitter Heron
● Apache Spark Streaming
● Apache Flink
● Apache Beam
● Apache Kafka Streams
● Apache Samza
● Apache Gearpump
● Apache Apex
● Apache Ignite Streaming
● Apache S4
● ...
Resource management
● Apache YARN
● Apache Mesos (1.0.1!)
● Apache Slider - deploy existing
apps on YARN
● Apache Myriad - YARN on
Mesos
● DC/OS
Source: https://docs.mesosphere.com/wp-content/uploads/2016/04/[email protected]
AnalysisSQL Engines & Querying
● Apache Hive
● Apache Pig
● Apache HAWQ
● Apache Impala
● Apache Phoenix
● Apache Spark SQL
● Apache Drill
● Facebook Presto
● ...
Hadoop-related
● Apache Sqoop
● Apache Flume
● Apache Oozie
● Hue
● Apache HDFS
● Apache Ambari
● Apache Knox
● Apache ZooKeeper
Conclusions● There is a lot of it!
● https://pixelastic.github.io/pokem
onorbigdata/
● If you want to learn, start with
SMACK stack (Spark, Mesos,
Akka, Cassandra, Kafka)
Articles & references● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t
echnologies/
● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-
frameworks-part-1
● https://dcos.io/
● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
● http://spark.apache.org/
● https://flink.apache.org/
● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of-
big-data-analytics-frameworks
● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing
● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Thank you, Q&A?@mmatloka
http://www.slideshare.net/softwaremillhttps://softwaremill.com/blog/