extracting twitter data using apache flume

Extracting Twitter Data using Apache

FlumeBy Bharat Khanna

Talend ETL Developer

What you need ??

• Horton works Hadoop Cluster :- HDP 1.3• Oracle Virtual Box• Putty • Winscp• Maven (for creating flume-snapshot.jar)

What is Flume ?• Flume is a distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.

Network Settings at Oracle Virtual Box

Network Settings at Oracle Virtual Box Contd..

Getting Started

• Run your Hadoop Cluster in Virtual Box. Once it is started, make sure you are able to connect to HDFS from your host windows machine by giving address as something like http://192.168.56.101:8000.

• This IP address you will get when you run ifconfig command in your Hadoop cluster once it is started.

http://192.168.56.101:8000/

http://192.168.56.101:8000/

File Browser using HUE

• Your HDFS interface from host machine may look like below: -

Setting your bash_profile in Putty

• It is important to set environment variables by editing bash_profile that can edited using command “vi .bash_profile”(You need dot before bash_profile as by default it is hidden) at your home directory. Exclude Maven_Home below for now.

Creating Flume Snapshot.jar

• This jar contains necessary libraries for proper functioning of Flume. This can be either downloaded by googling or we can create it ourselves. Best is to create it ourselves.

• You need Maven software for this. If your java version is 1.6, which is in Hortonworks HDP 1.3 , then download archived version of Maven i.e. 3.0.5 from http://archive.apache.org/dist/maven/maven-3/ else use any latest version.

http://archive.apache.org/dist/maven/maven-3/

http://archive.apache.org/dist/maven/maven-3/

Creating Flume Snapshot.jar Contd..

• Once download, unzip the folder in windows, and transfer it to your Hortonworks cluster using Winscp.

• Create a link to the folder by command “ln -s apache-maven-3.0.5 maven” in your home directory folder.

• Set the path of this link in your bash_profile as shown in slide 8.

• Logoff and login again to Unix session after saving your bash_profile to implement changes. Run command “mvn -version” to check its working.

Creating Flume Snapshot.jar Contd..

• Download Cloudera’s Twitter Code zip file from https://github.com/cloudera/cdh-twitter-example.

• Unzip it and transfer it to your home directory in Hortonworks cluster using Winscp.

• Go to flume-sources folder under folder cdh-twitter-example-master and run command “mvn package” to build the flume snapshot.jar file. This file can be found under target folder in same directory.

https://github.com/cloudera/cdh-twitter-example



Configuring Flume• Transfer the flume-sources-1.0-SNAPSHOT.jar to lib directory of

flume under location /etc/flume/apache-flume-1.6.0-bin/lib for Hortonworks 1.3 VM.

• Flume’s configuration directory can be found at /etc/flume/apache-flume-1.6.0-bin/conf.

• Open flume-env.sh.template file in vi editor , set Java_Home Path as defined in the bash_profile and Flume Classpath as the path of flume-snapshot.jar in double quotes.

• Rename flume-env.sh.template to flume-env.sh using mv command.

Configuring Flume contd..

• You also need to transfer following jar files to flume lib folder. Jar From Directoryhadoop-core.jar HADOOP_HOME i.e. /usr/lib/hadoophadoop-client-1.2.0.1.3.0.0-107.jar

HADOOP_HOME i.e. /usr/lib/hadoop

jets3t-0.6.1.jar /usr/lib/hadoop/libcommons-httpclient-3.0.1.jar

/usr/lib/hadoop/lib

commons-configuration-1.6.jar

/usr/lib/hadoop/lib

commons-codec-1.4.jar /usr/lib/hadoop/lib

Creating Twitter App

• Go to dev.twitter.com and click on create a new app.• Give your name , description and website may be like

http://yourdomain.com.• After creating app, go to Keys and Access tokens and

create your consumer key , consumer secret , access token and access token secret.

• Make a note of it as you need that in subsequent steps.

http://yourdomain.com/

http://yourdomain.com/

Creating conf file • Go to folder , /etc/flume/apache-flume-1.6.0-bin/conf and open a new file

named Twitter.conf.• A Sample Image of it is shown in next slide. You need to insert your

consumer key , consumer secret , access token and access token secret that you got in previous step.

• Then you need to enter keywords for which you want to analyze the data.• At last, you need to give your hdfs path that you can get from

fs.default.name in core-site.xml file under Hadoop_Home/conf i.e. /usr/lib/hadoop/conf

Checks before running flume-Setting Timezone

• Make sure that the time being shown in your VM matches with what you can see in your local machine. If they are not, you need to reset the time as shown below. You can time in your VM by “date” command.

• If your Timezone is matching , you can skip next 2 steps.• Time zone is controlled by /etc/localtime file. You can check the list

of timezones available under /usr/share/zoneinfo/ directory.• cd /etc• ln -s /usr/share/zoneinfo/US/Eastern localtime

Checks before running flume-Setting Oracle Virtual Box Properties

• You need to make sure that you can always reset your time in VM as you have done in previous step. For that you need to set following properties at VirtualBox.

• In Windows, start a command line interpreter, go to C:\Program Files\Oracle folder and click VirtualBox to select, then holding left shift key, do a mouse right-button click and select "Open command window here" menu, the interpreter has to be running now.

Checks before running flume-Setting Oracle Virtual Box Properties Contd..• Run following commands in command prompt.VBoxManage setextradata ${VMNAME} "VBoxInternal/Devices/VMMDev/0/Config/GetHostTimeDisabled" 1$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-interval" 10000$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-min-adjust" 100$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-set-on-restore" 1$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold" 1000

Running Flume

• Go to flume bin directory and run the flume agent using following command:-

• flume-ng agent -n TwitterAgent -c conf -f /etc/flume/apache-flume-1.6.0-bin/conf/twitter.conf

• After sometime, you may start getting files like below under directory specified in conf file.

Error Catalog• You may face following frequently occurring errors while running flume.Apache flume Error - java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j FilterQueryFix :- This happens because of FilterQuery.class occurring in two different jars( one of which will be flume-snapshot.jar) .You can search for those clashing jars using command :- “find . -name "*.jar" | xargs grep FilterQuery.class” under lib directory of flume.Rename the other jar by suffixing jar name with .org.

Error Catalog Contd..

• Apache flume Error :- java.io.IOException: Callable timed out after 10000 ms on file:

Fix :- This happens because of too many connections to twitter from your account. Just wait for some time and try again.

extracting twitter data using apache flume

Data & Analytics