apache spark - yandex
TRANSCRIPT
![Page 1: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/1.jpg)
1
![Page 2: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/2.jpg)
2
Apache Spark
Егор Пахомов Я.Субботник в Минске, 30.08.2014
![Page 3: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/3.jpg)
3
О себе
! Участвовал в создании Apache Solr based search engine для крупного западного e-commerce
! Разрабатывал Яндекс.Острова последний год
! Spark Contributor
![Page 4: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/4.jpg)
4
Сложности больших данных
! Файлы слишком большие, чтобы разместить на одной машине
! Суперкомпьютеры дорогие и трубуют специальной экспертизы
! В случае поломки одной машины → перепрогонять весь алгоритм
![Page 5: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/5.jpg)
5
Hadoop придет на помощь!
![Page 6: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/6.jpg)
6
HDFS как хранилище файлов
HDSF Client NameNode Secondary
NameNode
DataNode
DataNode
DataNode
DataNode
DataNode
![Page 7: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/7.jpg)
7
Модель map-reduce
![Page 8: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/8.jpg)
8
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
![Page 9: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/9.jpg)
9
Что сделал Hadoop
! Fault tolerance ! Data locality
! Новая вычислительная модель
![Page 10: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/10.jpg)
10
Проблемы
! Тяжело реализовать алгоритм в map-reduce моделе ! Очень простой алгоритм требует как минимум 3 класса
! Частая запись на диск
![Page 11: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/11.jpg)
11
RDD
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012.
![Page 12: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/12.jpg)
12
Word count
val wordCounts = spark .textFile("/user/usr/in") .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
![Page 13: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/13.jpg)
13
public class HadoopAnalyzer { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(HadoopAnalyzer.class); conf.setJobName("wordcount"); System.out.println("Executing job."); Job job = new Job(conf, "job"); job.setInputFormatClass(InputFormat.class); job.setOutputFormatClass(OutputFormat.class); job.setJarByClass(HadoopAnalyzer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); TextInputFormat.addInputPath(job, new Path("/user/usr/in")); TextOutputFormat.setOutputPath(job, new Path("/user/usr/out")); job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); job.waitForCompletion(true); System.out.println("Done."); } }
Whole program comparison
![Page 14: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/14.jpg)
14
package ru.yandex.spark.examples import org.apache.spark.{SparkConf, SparkContext} object WordCount { def main(args: Array[String]) { val conf = new SparkConf() conf.setAppName("serp-api") conf.setMaster("yarn-client") val spark = new SparkContext(conf) spark.addJar(SparkContext.jarOfClass(this.getClass).get) val wordCounts = spark .textFile("/user/usr/in") .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) wordCounts.saveAsTextFile("/user/usr/out") spark.stop() } }
Whole program comparison
![Page 15: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/15.jpg)
15
Actions
def collect(): Array[T] def saveAsTextFile(path: String) def toLocalIterator: Iterator[T]
Transformations
def filter(f: T => Boolean): RDD[T] def map[U: ClassTag](f: T => U): RDD[U] def foreachPartition(f: Iterator[T] => Unit) def zipWithIndex(): RDD[(T, Long)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] def reduceByKey(func: (V,V) => V,numPartitions: Int):RDD[(K, V)]
In one partition(narrow)
Cross-partition(wide)
![Page 16: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/16.jpg)
16
Скорость
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Hadoop
Spark
Количество итераций
Время
(с)
![Page 17: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/17.jpg)
17
Языки
Python file = spark.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...")
Scala val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
![Page 18: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/18.jpg)
18
Языки
Java-8 JavaRDD<String> lines = sc.textFile("hdfs://log.txt"); JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" "))); JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://counts.txt");
![Page 19: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/19.jpg)
19
Языки
Java-7 JavaRDD<String> file = spark.textFile("hdfs://..."); JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://...");
![Page 20: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/20.jpg)
20
SQL
Spark SQL
Spark Hadoop
![Page 21: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/21.jpg)
21
![Page 22: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/22.jpg)
22
Hive начинает использовать Spark в качестве execution engine
![Page 23: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/23.jpg)
23
Машинное обучение Spark Hadoop
SparkML
~ 13 algorithms ~ 24 algorithms
![Page 24: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/24.jpg)
24
Легко писать import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/kmeans_data.txt") val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) // Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 val clusters = KMeans.train(parsedData, numClusters, numIterations) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) println("Within Set Sum of Squared Errors = " + WSSSE)
![Page 25: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/25.jpg)
25
Алгоритмы
! Classification and regression – linear support vector machine (SVM) – logistic regression – linear least squares, Lasso, and ridge regression – decision tree – naive Bayes ! Collaborative filtering
– alternating least squares (ALS)
![Page 26: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/26.jpg)
26
Алгоритмы
! Clustering – k-means ! Dimensionality reduction
– singular value decomposition (SVD) – principal component analysis (PCA) ! Optimization
– stochastic gradient descent – limited-memory BFGS (L-BFGS)
![Page 27: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/27.jpg)
27
И Mahout начинает использовать Spark
![Page 28: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/28.jpg)
28
Graph Spark Hadoop
![Page 29: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/29.jpg)
29
Streaming batch processing Spark Hadoop
![Page 30: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/30.jpg)
30
Streaming batch processing
Spark streaming
Spark engine
Input data stream
Batches of input data
Batches of processed data
![Page 31: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/31.jpg)
31
![Page 32: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/32.jpg)
32
![Page 33: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/33.jpg)
33
Единая платформа для приложений Big Data
Batch Interac+ve Streaming
Hadoop Cassandra Mesos Cloud Providers
…
…
Единый API для выполнения различных задач в различных типах хранилищ и средах выполнения
![Page 34: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/34.jpg)
34
Контрибьютеры
200
400
600
800
1000
1200
1400
Сравнение с другими проектами
Spa
rk
Sto
rm
HD
FS
YAR
N
Map
Red
uce
50000
Spa
rk
Sto
rm
HD
FS
YAR
N
Map
Red
uce
100000
150000
200000
250000
300000
350000
Активность за последние 6 месяцев Коммиты Измененные строчки кода
![Page 35: Apache Spark - Yandex](https://reader030.vdocuments.mx/reader030/viewer/2022012505/618059339e4bf009fe0ad965/html5/thumbnails/35.jpg)
35
Спасибо за внимание!