extracting twitter data using apache flume

22
Extracting Twitter Data using Apache Flume By Bharat Khanna Talend ETL Developer

Upload: bharat-khanna

Post on 23-Jan-2017

897 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Extracting twitter data using apache flume

Extracting Twitter Data using Apache

FlumeBy Bharat Khanna

Talend ETL Developer

Page 2: Extracting twitter data using apache flume

What you need ??

• Horton works Hadoop Cluster :- HDP 1.3• Oracle Virtual Box• Putty • Winscp• Maven (for creating flume-snapshot.jar)

Page 3: Extracting twitter data using apache flume

What is Flume ?• Flume is a distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.

Page 4: Extracting twitter data using apache flume

Network Settings at Oracle Virtual Box

Page 5: Extracting twitter data using apache flume

Network Settings at Oracle Virtual Box Contd..

Page 6: Extracting twitter data using apache flume

Getting Started

• Run your Hadoop Cluster in Virtual Box. Once it is started, make sure you are able to connect to HDFS from your host windows machine by giving address as something like http://192.168.56.101:8000.

• This IP address you will get when you run ifconfig command in your Hadoop cluster once it is started.

Page 7: Extracting twitter data using apache flume

File Browser using HUE

• Your HDFS interface from host machine may look like below: -

Page 8: Extracting twitter data using apache flume

Setting your bash_profile in Putty

• It is important to set environment variables by editing bash_profile that can edited using command “vi .bash_profile”(You need dot before bash_profile as by default it is hidden) at your home directory. Exclude Maven_Home below for now.

Page 9: Extracting twitter data using apache flume

Creating Flume Snapshot.jar

• This jar contains necessary libraries for proper functioning of Flume. This can be either downloaded by googling or we can create it ourselves. Best is to create it ourselves.

• You need Maven software for this. If your java version is 1.6, which is in Hortonworks HDP 1.3 , then download archived version of Maven i.e. 3.0.5 from http://archive.apache.org/dist/maven/maven-3/ else use any latest version.

Page 10: Extracting twitter data using apache flume

Creating Flume Snapshot.jar Contd..

• Once download, unzip the folder in windows, and transfer it to your Hortonworks cluster using Winscp.

• Create a link to the folder by command “ln -s apache-maven-3.0.5 maven” in your home directory folder.

• Set the path of this link in your bash_profile as shown in slide 8.

• Logoff and login again to Unix session after saving your bash_profile to implement changes. Run command “mvn -version” to check its working.

Page 11: Extracting twitter data using apache flume

Creating Flume Snapshot.jar Contd..

• Download Cloudera’s Twitter Code zip file from https://github.com/cloudera/cdh-twitter-example.

• Unzip it and transfer it to your home directory in Hortonworks cluster using Winscp.

• Go to flume-sources folder under folder cdh-twitter-example-master and run command “mvn package” to build the flume snapshot.jar file. This file can be found under target folder in same directory.

Page 12: Extracting twitter data using apache flume

Configuring Flume• Transfer the flume-sources-1.0-SNAPSHOT.jar to lib directory of

flume under location /etc/flume/apache-flume-1.6.0-bin/lib for Hortonworks 1.3 VM.

• Flume’s configuration directory can be found at /etc/flume/apache-flume-1.6.0-bin/conf.

• Open flume-env.sh.template file in vi editor , set Java_Home Path as defined in the bash_profile and Flume Classpath as the path of flume-snapshot.jar in double quotes.

• Rename flume-env.sh.template to flume-env.sh using mv command.

Page 13: Extracting twitter data using apache flume

Configuring Flume contd..

• You also need to transfer following jar files to flume lib folder. Jar From Directoryhadoop-core.jar HADOOP_HOME i.e. /usr/lib/hadoophadoop-client-1.2.0.1.3.0.0-107.jar

HADOOP_HOME i.e. /usr/lib/hadoop

jets3t-0.6.1.jar /usr/lib/hadoop/libcommons-httpclient-3.0.1.jar

/usr/lib/hadoop/lib

commons-configuration-1.6.jar

/usr/lib/hadoop/lib

commons-codec-1.4.jar /usr/lib/hadoop/lib

Page 14: Extracting twitter data using apache flume

Creating Twitter App

• Go to dev.twitter.com and click on create a new app.• Give your name , description and website may be like

http://yourdomain.com.• After creating app, go to Keys and Access tokens and

create your consumer key , consumer secret , access token and access token secret.

• Make a note of it as you need that in subsequent steps.

Page 15: Extracting twitter data using apache flume

Creating conf file • Go to folder , /etc/flume/apache-flume-1.6.0-bin/conf and open a new file

named Twitter.conf.• A Sample Image of it is shown in next slide. You need to insert your

consumer key , consumer secret , access token and access token secret that you got in previous step.

• Then you need to enter keywords for which you want to analyze the data.• At last, you need to give your hdfs path that you can get from

fs.default.name in core-site.xml file under Hadoop_Home/conf i.e. /usr/lib/hadoop/conf

Page 16: Extracting twitter data using apache flume
Page 17: Extracting twitter data using apache flume

Checks before running flume-Setting Timezone

• Make sure that the time being shown in your VM matches with what you can see in your local machine. If they are not, you need to reset the time as shown below. You can time in your VM by “date” command.

• If your Timezone is matching , you can skip next 2 steps.• Time zone is controlled by /etc/localtime file. You can check the list

of timezones available under /usr/share/zoneinfo/ directory.• cd /etc• ln -s /usr/share/zoneinfo/US/Eastern localtime

Page 18: Extracting twitter data using apache flume

Checks before running flume-Setting Oracle Virtual Box Properties

• You need to make sure that you can always reset your time in VM as you have done in previous step. For that you need to set following properties at VirtualBox.

• In Windows, start a command line interpreter, go to C:\Program Files\Oracle folder and click VirtualBox to select, then holding left shift key, do a mouse right-button click and select "Open command window here" menu, the interpreter has to be running now.

Page 19: Extracting twitter data using apache flume

Checks before running flume-Setting Oracle Virtual Box Properties Contd..• Run following commands in command prompt.VBoxManage setextradata ${VMNAME} "VBoxInternal/Devices/VMMDev/0/Config/GetHostTimeDisabled" 1$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-interval" 10000$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-min-adjust" 100$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-set-on-restore" 1$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold" 1000

Page 20: Extracting twitter data using apache flume

Running Flume

• Go to flume bin directory and run the flume agent using following command:-

• flume-ng agent -n TwitterAgent -c conf -f /etc/flume/apache-flume-1.6.0-bin/conf/twitter.conf

• After sometime, you may start getting files like below under directory specified in conf file.

Page 21: Extracting twitter data using apache flume

Error Catalog• You may face following frequently occurring errors while running flume.Apache flume Error - java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j FilterQueryFix :- This happens because of FilterQuery.class occurring in two different jars( one of which will be flume-snapshot.jar) .You can search for those clashing jars using command :- “find . -name "*.jar" | xargs grep FilterQuery.class” under lib directory of flume.Rename the other jar by suffixing jar name with .org.

Page 22: Extracting twitter data using apache flume

Error Catalog Contd..

• Apache flume Error :- java.io.IOException: Callable timed out after 10000 ms on file:

Fix :- This happens because of too many connections to twitter from your account. Just wait for some time and try again.