installation and setup spark published

23
DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193 2/11/17 SPARK SETUP Installation and Setup Spark

Upload: er-dipendra-kusi

Post on 21-Feb-2017

87 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Installation and Setup Spark

Page 2: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Page 3: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 1: First setup the Cloudera

Step 2: Open terminal in Cloudera and start spark

usr/bin/spark-shell

Step 3: After start of spark we can write scala command to execute in spark using spark context

Now read the file from hdfs. Here there is input file in hdfs

val dt = sc.textFile("/user/cloudera/project_data/input") We can keep file in hdfs using:

hadoop fs -put file0 /user/cloudera/project_data/input

Page 4: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 4: Now, we will split the text content based on whitespace and then count the word

val wordcount = dt.flatMap(x=>x.split(" ")).map(x=>(x,1)) .reduceByKey((a,b)=>a+b))

Step 5: Now print the result:

for(value <- wordcount) {println(value)}

Page 5: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Integrate the Spark in eclipse:

Step 1: First go to eclipse and setup the scala plugin.

Go to Help-> Eclipse Market Place

Step 2: Now search scala plugin and install the plugin

Page 6: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Click on install

Click on confirm

Then, Accept and install

Step 3: Now, check whether scala plugin is installed or not in eclipse

Go to New-> other-> type scala

Page 7: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

If there is scala App then scala plugin is installed

Step 4: Now create maven project

Got to New->other-> type maven project -> next->next->next

Page 8: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 5: Now give the

Group Id: edu.sparkproject

Artifact Id: WordCount

Click Finish

Step 6:

Now go to pom.xml file and edit dependency to spark

Page 9: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 7: Now Copy and paste the code below in pom.xml

Link: http://pastebin.com/V5n0hM5P

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.scalaproject</groupId> <artifactId>scalaproject</artifactId> <version>0.0.1-SNAPSHOT</version> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <repositories> <repository> <id>pele.farmbio.uu.se</id> <url>http://pele.farmbio.uu.se/artifactory/libs-snapshot</url> </repository> </repositories> <dependencies>

Page 10: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency> </dependencies> <build> <plugins> <!-- mixed scala/java compile --> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <id>compile</id> <goals> <goal>compile</goal> </goals> <phase>compile</phase> </execution> <execution> <id>test-compile</id> <goals> <goal>testCompile</goal> </goals> <phase>test-compile</phase> </execution> <execution> <phase>process-resources</phase> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> <!-- for fatjar --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.4</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration>

Page 11: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

<executions> <execution> <id>assemble-all</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <mainClass>fully.qualified.MainClass</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> <pluginManagement> <plugins> <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself. --> <plugin> <groupId>org.eclipse.m2e</groupId> <artifactId>lifecycle-mapping</artifactId> <version>1.0.0</version> <configuration> <lifecycleMappingMetadata> <pluginExecutions> <pluginExecution> <pluginExecutionFilter> <groupId>org.scala-tools</groupId> <artifactId> maven-scala-plugin </artifactId> <versionRange> [2.15.2,) </versionRange> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </pluginExecutionFilter> <action> <execute></execute> </action> </pluginExecution> </pluginExecutions>

Page 12: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

</lifecycleMappingMetadata> </configuration> </plugin> </plugins> </pluginManagement> </build> </project>

Now save it. It will download all the dependency.

Step 8: Now convert the project into Scala project

First delete the src/test/java folder

Now fix the error by clicking in quick fix and ok.

The error will disappear.

Page 13: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 9: Now convert project into Scala Nature

Step 10:

Right click on project -> properties

Page 14: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 11:

Now go to Scala Compiler -> tick on Use Project Setting -> select Fixed Scala Installation 2.10.6-> Apply ->

Ok

(Spark only support Scala version 2.10 so we need to match the scala version running on Spark )

Page 15: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 12: Then go to Java Build Path -> remove Scala Library Container

(Spark core contain Scala Library Container so no need to have library here)

Now rename the package to Scala

Page 16: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 13: Now add the Scala Object File

Page 17: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Page 18: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Give the Scala Object Name -> Count

Page 19: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 14:

Now copy code from and paste into Word.scala file

Link: http://pastebin.com/XNpbcJ2z

package com.scalaproject.scalaproject import org.apache.spark.SparkConf import org.apache.spark.SparkContext import java.nio.file.{Paths, Files} import java.io._ import org.apache.commons.io.FileUtils import org.apache.commons.io.filefilter.WildcardFileFilter import scala.collection.immutable

Page 20: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

object WordCount { def main(args: Array[String]) = { //Start the Spark context val conf = new SparkConf() .setAppName("WordCount") .setMaster("local") val sc = new SparkContext(conf) val test = sc.textFile("input.txt") test.flatMap( x => x.split("\\s+")).map(x=>(x,1)).reduceByKey((a,b)=>a+b).saveAsTextFile("output") //Stop the Spark context sc.stop } def splitting(v:String): Array[String] = { v.split(" ") } }

Step 15:

Now add the input.txt file as input file to be processed.

Page 21: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Add the text to input.txt file so that we can process it.

Page 22: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP

Step 16: Now run the code

Step 17: Refresh the project.

You will see the output folder in the project-> go inside it there will be part-0000 that contain the output

Page 23: Installation and setup spark published

DIPENDRA KUSI https://www.linkedin.com/in/er-dipendra-kusi-b3674193

2/11/17 SPARK SETUP