power of python with big data

20

Upload: edureka

Post on 16-Apr-2017

916 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Power of Python with Big Data
Page 2: Power of Python with Big Data

What will you learn today?

Introduction to Big Data

Why Python is popular with Big Data?

Running MapReduce in Python

Working with Python NLTK and Hadoop

Demo on Zombie Invasion Model

Data Analytics with Pandas

Page 3: Power of Python with Big Data

Big Data and Hadoop

Page 4: Power of Python with Big Data

Big Data

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization

cloud

tools

statistics

No SQL

compression

storage

support

database

analize

information

terabytes

processing

mobile

Big Data

Page 5: Power of Python with Big Data

Un-Structured Data is Exploding

Complex, Unstructured

Relational

2500 exabytes of new information in 2012 with internet as primary driver

Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Page 6: Power of Python with Big Data

Hadoop for Big Data

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of

commodity computers using a simple programming model

It is an Open-source Data Management with scale-out storage & distributed processing

Page 7: Power of Python with Big Data

Why Python With Big Data?

Page 8: Power of Python with Big Data

Why Python is popular with Big data?

Data Cleansing / Preparation

Writing Map Reduce Using Python

Leveraging Analytical power of Python on Big Data Set

With libraries like PyDoop and SciPy, it’s a dream come true for Big Data Analytics

Page 9: Power of Python with Big Data

Demo: Data Preparation / Cleaning

Extracting Data

- Extract Data from Complex JSON for processing

Text analytics

- Remove stop words from a text Paragraph for further processing

Page 10: Power of Python with Big Data

Demo

Page 11: Power of Python with Big Data

PyDoop – Hadoop with Python

One of the biggest advantage of PyDoop is it’s HDFS API. This allows

you to connect to an HDFS installation, read and write files, and get

information on files, directories and global file system properties

The MapReduce API of PyDoop allows you to solve many complex

problems with minimal programming efforts. Advance MapReduce

concepts such as ‘Counters’ and ‘Record Readers’ can be implemented

in Python using PyDoop

Python can be used to write Hadoop MapReduce programs and applications to access HDFS API for Hadoop with

PyDoop package

Page 12: Power of Python with Big Data

Python NLTK on Hadoop

Page 13: Power of Python with Big Data

Python and Data Science

Python has a diverse range of open source

libraries for just about everything that a

Data Scientist does in his day-to-day work

Python and most of its libraries are both

open source and free

The day-to-day tasks of a data scientist involves many interrelated but different activities such as accessing

and manipulating data, computing statistics and , creating visual reports on that data, building predictive and

explanatory models, evaluating these models on additional data, integrating models into production systems,

etc.

Page 14: Power of Python with Big Data

SciPy.org

SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science,

and engineering

NumPyBase N-dimensional array package

IPythonEnhanced Interactive Console

SciPy libraryBase N-dimensional array package

SympySymbolic mathematics

MatplotlibComprehensive 2D Plotting

pandasData structures and analysis

Page 15: Power of Python with Big Data

Demo: Zombie Invasion Model

This is a lighthearted example, a system of ODEs(Ordinary differential equations) can be used to model a

"zombie invasion", using the equations specified by Philip Munz

The system is given as:

dS/dt = P - B*S*Z - d*S

dZ/dt = B*S*Z + G*R - A*S*Z

dR/dt = d*S + A*S*Z - G*R

There are three scenarios given in the program to show how Zombie Apocalypse vary with different initial

conditions

This involves solving a system of first order ODEs given by: dy/dt = f(y, t) Where y = [S, Z, R]

Where:S: the number of susceptible victimsZ: the number of zombiesR: the number of people "killed”

P: the population birth rated: the chance of a natural deathB: the chance the "zombie disease" is transmitted (an alive person becomes a zombie)G: the chance a dead person is resurrected into a zombieA: the chance a zombie is totally destroyed

Page 16: Power of Python with Big Data

Demo

Page 17: Power of Python with Big Data

Python Pandas – Data Frames

Page 18: Power of Python with Big Data

Demo

Page 19: Power of Python with Big Data

Course Details

Become an expert in Python by Edureka

Go to www.edureka.co/python

Edureka's Mastering Python course:

• This course will cover both basic and advance concepts of Python like writing python scripts, sequence and file operations inpython, Machine Learning in Python, Web Scraping, Map Reduce in Python, Hadoop Streaming, Python UDF for Pig and Hive.

• You will also go through important and most widely used packages like pydoop, pandas, scikit, numpy, scipy etc.• Online Live Courses: 30 hours• Assignments: 40 hours• Project: 20 hours• Lifetime Access + 24 X 7 Support

Page 20: Power of Python with Big Data

Thank You

Questions/Queries/Feedback

Recording and presentation will be made available to you within 24 hours