big data analytics of various growth parameters for asian subcontinents

A

Synopsis Report on

Big Data Analytics of Various Growth Parameters for Asian Subcontinents

Submitted in partial fulfillment of the requirements for the award of the degree of Bachelor of Technology in Computer Science & Engineering

From

IMSEC, Ghaziabad

DR. APJ ABDUL KALAM Technical University Lucknow

(Session 2016-17) Under Supervision of:

Dr. S N Rajan

By: Syed Ali Imtiyaz Rizvi (1314310108) Nabeel khan (1314310050) Tarun Kumar (1314310112)

INDEX

S.no Page no.

1.)Introduction 1 2-2.)Brief literature survey 3-4

3.)Problem formulation 5

4.)Objective 6

5.)Novelty 7

6.)Methodology 8-9

7.)Implementation 10-12

8.)Facilities required for proposed 13-16

project

9.)References 17

INTRODUCTION

Agriculture is the backbone of economic system of a country. The evolution of an agriculture sector contributes to marketable surplus. Nation’s export trade depends largely on agricultural zone. Stable agriculture system ensures a global food security. Agriculture systems are dynamic and managed ecosystems. Climate is the primary contributes of agriculture. Climate variability is expected to influence on crop and livestock production. Human response is critical to understanding and estimating the effects of climate change on production. The climate change like higher temperature has been found to reduce yields and quality of many crops. So new approaches are required for farmers with updated and relevant information so as to support them in the decision making process, and to make them more resilient to climate change.

More than 60% of the Indian population is still living in rural areas. Farming is the only occupation for them and is the only source of income for them.

We want to develop a prediction model which analyze the large historical data sets using big data analytics. Big data analytics is the process of examining large amount of data coming from heterogeneous sources in variety of heterogeneous formats.

The analysis is performed through Hadoop tool (Hive and Pig) and R.Since Hadoop does not provides the GUI (Graphic User Interface) so we use R language to represent the trends(Result) through pie charts,bar graph and histogram.

BRIEF LITERATURE SURVEYVikram Phaneendra & E. Madhusudhan Reddy Illustrated that in olden days the data was less and easily handled by RDBMS but recently it is difficult to handle huge data through RDBMS tools, which is preferred as “big data”. In this they told that big data differs from other data in 5 dimensions such as volume, velocity, variety, value and complexity. They illustrated the hadoop architecture consisting of name node, data node, edge node, HDFS to handle big data systems. Hadoop architecture handle large data sets, scalable algorithm does log management application of big data can be found out in financial, retail industry, health-care, mobility, insurance.

Kiran kumara Reddi & Dnvsl Indira Enhanced us with the knowledge that Big Data is combination of structured , semi-structured ,unstructured homogenous and heterogeneous data . The author suggested to use nice model to handle transfer of huge amount of data over the network .Under this model, these transfers are relegated to low demand periods where there is ample ,idle bandwidth available . This bandwidth can then be repurposed for big data transmission without impacting other users in system.

Jimmy Lin used Hadoop which is currently the large –scale data analysis “ hammer” of choice, but there exists classes of algorithms that aren’t “ nails” in the sense that they are not particularly amenable to the MapReduce programming model . He focuses on the simple solution to find alternative non-iterative algorithms that solves the same problem. The standard MapReduce is well known and described in many places .Each iteration of the pagerank corresponds to the MapReduce job. The author suggested iterative graph, gradient descent & EM

iteration which is typically implemented as Hadoop job with driven set up iteration

&Check for convergences. The author suggests that if all you have is a hammer, throw away everything that’s not a nail .

Wei Fan & Albert Bifet Introduced Big Data Mining as the capability of extracting Useful information from these large datasets or streams of data that due to its Volume, variability and velocity it was not possible before to do it. The author also started that there are certain controversy about Big Data. There certain tools for processes. Big Data as such hadoop, strom, apache S4. Specific tools for big graph mining were PEGASUS & Graph. There are certain Challenges that need to death with as such compression, visualization etc.

Albert Bifet Stated that streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge, allowing organizations to react quickly when problem appear or detect to improve performance. Huge amount of data is created everyday termed as “ big data”. The tools used for mining big data are apache hadoop, apache big, cascading, scribe, storm, apache hbase, apache mahout, MOA, R, etc. Thus, he instructed that our ability to handle many exabytes of data mainly dependent on existence of rich variety dataset, technique, software framework .

Bernice Purcell Started that Big Data is comprised of large data sets that can’t be handle by traditional systems. Big data includes structured data, semi-structured and unstructured data. The data storage technique used for big data includes multiple clustered network attached storage (NAS) and object based storage. The Hadoop architecture is used to process unstructured and semi-structured using map reduce to locate all relevant data then select only the data directly answering the query. The

advent of Big Data has posed opportunities as well challenges to business

PROBLEM FORMULATION1. Every country facing Economic Problem now a days.

2. Agriculture, Climate, Health and Economic analyzing related data is a huge solution for governments to make strategic decisions.

3. Agriculture is one of the factor for increasing the economy of country.

4. A lot of research is made in this domain but still no complete solution is found. Several Hadoop techniques are being in use for data analyse.

5. It also does not provide proper or easy GUI using hadoop for making Decision.

6. Experiment dataset had been collected from worldbank data. But for the convenient of the experiment we only select structure and unstructured data.

7. Problem also comes in performing query because the latency time hive is not good.

8. GUI problem is solved using R but it also take time to connect data of hive with R.

9. Hive and Pig is slower to execute any task as compared to other languages.

10. After analysing it hard to extract useful information by giving Hypothesis and applying function.

OBJECTIVES

The main objective of this project is to analyze and predicts the environmental changes in the Asian countries and to assist farmers in adapting to climate-smart Agriculture systems using big data Approach which increase the income and productivity of farmers.

Our main objective is to analyze the Countries economic which are in Asia on the basis of their climate , Agriculture , Health and Education variation so that we can adopt the technique to improve the agriculture.

Now a day, the increased climate variability is challenging to farmers. It will expect to influence on crop and livestock productivity. So new approaches are required to farmers with updated and relevant information so as to support them in the decision making process, and to make them more resilient to climate change .

NOVELTYRelative study of Gross productivity parameters for Asian Sub-Continent using Hadoop and R based Big data Analytics.

In this we are going to tell the number of factor responsible for effective economy of the country and also how they are interrelated to each other.

METHODLOGY/ WORK PLANNING

1.Our first task is to collect the data which affect the Agriculture such as Agriculture irrigated land, fertilizers consumptions, arable land etc and we also have to collect data of factors which affect climate such as access to electricity , CO2 emission, greenhouse gas emission etc.

2. After collecting the data we have to arrange this according to the requirements , so that we can use it efficiently.

3. After that, BigInsight software is opened and hadoop service is turned on by using specific commands.

4. Hive queries are then used to load data in HDFS cluster.

5. After that analysis is done on data by running hive queries and output is saved in csv file on local drive.

6. Then using the R tool for representation of the analysed data in the form of pie chart, bar graph, histogram.

7. This visualisations then serves as a way of drawing inferences from data.

Flow Chart

IMPLEMENTATION

The implementation process consists of four modules. Figure 2 shows the different modules. 1. Data Acquisition module. 2. Data Storage module. 3. Query Analysis module. 4. Presentation module.

1. Data Acquisition Module

This module collects data from various sources like Web sites, weather forecasting, social media data and market trends. These data can be issued manually, data can be acquired with access of data acquisition equipment, the small received data are first stored in an oracle database, when small data are gathered to a certain number, the small data will be transferred into the storage module, transferred data will be automatically deleted.

2. Data storage moduleIt is responsible for storage of metadata and data sets with replicated copy, which provide backup facility. HDFS is a storage container and is not limited by any type of data. Small data in the data acquisition module accumulated to a certain amount will be placed in the storage module on a regular basis.

3. Query Analysis module

This module is processing phase includes data reading/analyzing and establishment of forecast result. The data reading is done mainly by Hive. Hive is a framework for data warehousing on top of Hadoop. It was created to make it possible for analysts with strong SQL skills to run on the huge volumes of data stored in HDFS. Hive runs on workstations and convert SQL queries into series of MapReduce jobs for execution on a Hadoop cluster. MapReduce is an execution engine suitable for large data processing and can significantly improve the response speed for returning query results.

Proposed system framework

4. Presentation module

This module is responsible for representing the analysed data in the form of pie chart. This is very important module ,since the analysed does not understand in the form of csv file so we need a language which can represent the data into visualization form ie R.

ER Diagram

FACILITIES REQUIRED FOR PROPOSED WORK

Hadoop“Hadoop is a framework (open source) that allows for the distributed processing of large data sets across cluster of commodity computer using a simple processing model”

“Hadoop is a open source framework to store large amount of data for processing huge data sets with the cluster of commodity hardware. It provides massive storage for any kind of data ,enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.Hadoop is highly scalable analytic platform for processing large volume of unstructured and structured data. By large scale, we mean multiple petabyte of data spread across hundred or thousands of physical storage server and nodes.

Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

Hadooop , developed in 2005 and now an open source platform managed under Apache software foundation, uses a concept known as Map Reduce that is composed of two different functions.

Another component used to implement the data storage layer in HDFS or the Hadoop Distributed File System.Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster.

Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file

system(HDFS) and a number of related projects such as Apache Hive , Pig and jaql .

Hive“Data warehouse software built on the top of hadoop that facilitate querying, and managing large data sets residing in distributed storage”.

Hive is developed at Facebook in 2007.Hive is like SQL. It allows SQL developers to write Hive Query Language (HQL). hence HQL is limited to commands that it understands.HQL statements are broken down by the Hive service into MapReduce jobs and then executed across a Hadoop cluster. Hive is read based therefore not appropriate for transaction processing that involves a high percentage of write operations.HQL queries may take so much time to get execute (high latency) therefore hive is not appropriate for applications that need very fast response times.like RDBMS Hive also have DML and DDL.

PigPig or Piglatin was initially developed by yahoo.Pig is a platform for analyzing large data sets that consist of high level language for expressing data analysis programs, coupled with infrastructure for evaluating these program.Pig is designed to handle any kind of data.

Steps to work with Pig involves

1. LOAD

2. TRANSFORM

3. DUMP

4. STORE

1.LOAD means adding all the objects to HDFS on which you want to work

2.TRANSFORM provides logic where all the data manipulation happens. here you can FILTER, KOIN, GROUP or ORDER results without using DUMP and STORE pig will not generate output.

3.DUMP will send output to screen which is displayed there. during production purpose you can STORE the same data to any file. Dump command can be used anywhere in program hence it is very useful for debugging the results.

Pig's command line is called as GRUNT.Pig runtime environment translates the program into a set of MapReduce tasks and then run them.

Pig have two execution modes

Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. local mode is specified using the -x flag.

Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode ,no need to give any flag to access MapReduce Mode.

REFERENCES

[1]S.Vikram Phaneendra & E.Madhusudhan Reddy “Big Data- solutions for RDBMS problems- A survey” In 12th IEEE/IFIP Network Operations & Management Symposium (NOMS 2010) (Osaka, Japan, Apr 19{23 2013).[2] Kiran kumara Reddi & Dnvsl Indira “Different Technique to Transfer Big Data : survey” IEEE Transactions on 52(8) (Aug.2013) 2348 { 2355}

[3] Jimmy Lin “MapReduce Is Good Enough?” The control project. IEEE Computer 32 (2013).[4] Umasri.M.L, Shyamalagowri.D ,Suresh Kumar.S “Mining Big Data:- Current status and forecast to the future” Volume 4, Issue 1, January 2014 ISSN: 2277 128X.[5] Albert Bifet “Mining Big Data In Real Time” Informatica 37 (2013) 15–20 DEC 2012. [6] Bernice Purcell “The emergence of “big data” technology and analytics” Journal of Technology Research 2013.[7] Sameer Agarwal†, Barzan MozafariX, Aurojit Panda†, Henry Milner†, Samuel MaddenX, Ion Stoica “BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data” Copyright © 2013ì ACM 978-1-4503-1994 2/13/04 [8] Yingyi Bu _ Bill Howe _ Magdalena Balazinska _ Michael D. Ernst “The HaLoop Approach to Large-Scale Iterative Data Analysis” VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters. [9] Shadi Ibrahim⋆ _ Hai Jin _ Lu Lu “Handling Partitioning Skew in MapReduce using LEEN” ACM 51 (2008) 107–113[10] Kenn Slagter · Ching-Hsien Hsu “An improved partitioning mechanism for optimizing massive data analysis using MapReduce” Published online: 11 April 2013.[12] Jeffrey Dean and Sanjay Ghemawat “MapReduce: Simplified Data Processing on Large Clusters” OSDI 2010.