real world data analysis pandas · 5.03.2019 · real world data analysis pandas python package...

Real World Data AnalysisPANDAS

PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES

DR. SYED IMTIYAZ HASSANASSISTANT PROFESSOR, DEPARTMENT. OF CSE, JAMIA HAMDARD(DEEMED TO BE UNIVERSITY), NEW DELHI, INDIA.https://syedimtiyazhassan.orgs.imtiyaz@jamiahamdard.ac.inhttp://www.jamiahamdard.edu

INTRODUCTION

For fast, flexible, and expressive data structures.

Designed to make working with “relational” or “labeled” data.

Prepared from:

https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html

WELL SUITED FOR

Tabular data with heterogeneously-typed columns.

Ordered and unordered (not necessarily fixed-frequency) time series data.

Arbitrary matrix data (homogeneously typed orheterogeneous) with row and column labels.

Any other form of observational / statistical data sets.

The data actually need not be labeled at all to beplaced into a pandas data structure.

DATA STRUCTURES

Series: 1D labeled homogeneously-typed array.

DataFrame: General 2D labeled, size-mutabletabular structure with potentially heterogeneously-typed column.

• import numpy as np

• import pandas as pd

• s = pd.Series([1, 3, 5, np.nan, 6, 8])

SERIES

A Series by passing a list of values, letting pandascreate a default integer index.

import numpy as npimport pandas as pds = pd.Series([1, 3, 5, np.nan, 6, 8])s

OBJECT CREATION

A DataFrame by passing a NumPy array, with a:

datetime index and

labeled columns.

NumPy arrays have one dtype for the entire array, while pandasDataFrames have one dtype per column.

labeled columns

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))df

datetime index

dates = pd.date_range('20130101', periods=6)dates

OBJECT CREATION A DataFrame by passing a dict of objects that can be converted

to series-like.

DataFrame

df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20130102'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3] * 4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'})

df2.dtypes

VIEWING DATA df.head()

df.tail(3)

df.index

df.columns

df.describe()

df.to_numpy()

SORTING

By Axis

By Values

By Axis

df.sort_index(axis=1, ascending=False)

datetime index

df.sort_values(by='B')

SELECTION

Getting Selection by Label

df.loc

Selection by Position df.iloc

df.iat

Boolean Indexing

GETTING Selecting a single column, which yields a Series, equivalent todf.A

df['A']

Selecting via [], which slices the rows.

df[0:3]

df['20130102':'20130104']

SELECTION BY LABEL

Selecting on a multi-axis by label.

df1 = pd.DataFrame(np.random.randn(6, 4))

df1.loc[0]

df.loc[dates[0]]

df.loc[:, ['A', 'B']]

df.loc['20130102':'20130104', ['A', 'B']]

df.loc['20130102', ['A', 'B']]

df.loc[dates[0], 'A']

df.at[dates[0], 'A'] #Fast

SELECTION BY POSITION

df.iloc[3]

df.iloc[3:5, 0:2]

df.iloc[[1, 2, 4], [0, 2]]

df.iloc[1:3, :]

df.iloc[:, 1:3]

df.iat[1, 1]

df.iloc[1, 1]

BOOLEAN INDEXING

df[df.A > 0] df[df > 0]

df2 = df.copy()df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']df2

df2[df2['E'].isin(['two', 'four'])]

SETTING

df['F'] = s1s1

df.iat[0, 1] = 0df

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))

df.at[dates[0], 'A'] = 0df

df2 = df.copy()df2[df2 > 0] = -df2df2

MISSING DATA

df1 = df.copy()df1.dropna(how='any')

df1.fillna(value=5)

pd.isna(df1)

Drop any rows that have missing data.

Filling missing data.

Get the Boolean mask where values are nan.

OPERATIONS

Concat

Append

Grouping

STATSAPPLY

df.mean()

df.mean(1)

df.apply(np.cumsum)

Same operation on the other axis

Operations in general exclude missing data.

df.apply(lambda x: x.max() - x.min())

Applying functions to the data.

HISTOGRAM

s = pd.Series(np.random.randint(0, 7, size=10))s

s.value_counts()

CONCAT

df = pd.DataFrame(np.random.randn(10, 4))

pieces = [df[:3], df[3:7], df[7:]]pieces

pd.concat(pieces)

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})left

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

pd.merge(left, right, on='key')

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})left

right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})right

pd.merge(left, right, on='key')

APPEND

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])df

s = df.iloc[3]s

df.append(s, ignore_index=True)

GROUPING

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})

df.groupby('A').sum()

df.groupby(['A', 'B']).sum()

PLOTTING

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])

df.plot()

df = df.cumsum()

DATA FILES

Format Type Data Description Reader Writer

text CSV read_csv to_csv

text JSON read_json to_json

text HTML read_html to_html

text Local clipboard read_clipboard to_clipboard

binary MS Excel read_excel to_excel

binary HDF5 Format read_hdf to_hdf

binary Feather Format read_feather to_feather

binary Parquet Format read_parquet to_parquet

binary Msgpack read_msgpack to_msgpack

binary Stata read_stata to_stata

binary SAS read_sas

binary Python Pickle Format read_pickle to_pickle

SQL SQL read_sql to_sql

SQL Google Big Query read_gbq to_gbq

df.to_csv('foo.csv')

pd.read_csv('foo.csv')

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

df.to_excel('foo.xlsx', sheet_name='Sheet1')

THANK YOU

real world data analysis pandas · 5.03.2019 · real world data analysis pandas python package...

Documents

data handling using pandas -1 -...

data analysis with pandas

pandas - not just for data scientists · pandas - not just...

pandas-datareader documentation … · pandas-datareader...

python programming | · pdf filepython programming | pandas...

data handling using pandas -2 - mykvs.in

python pandas की समीक्षा · python pandas...

pandas-validation documentation - read the docs ·...

15-388/688 -practical data science: relational data ·...

saurabh kumar, shana moothedath & madhu n. belur ... · why...

talk3 data-analytics with-pandas-and-num py-chetan-khatri

lab 7 pandas ii: plotting with pandas - byu acme · pandas...

data analysis / data science on hadoop · overview of...

pandas - not just for data...

2. pandas - it academy · 2019-02-13 · python data...

pandas for panel data - python.quantecon.org · pandas for...

advanced tabular data processing with pandas - github...

introduction to python pandas for data...

data handling using pandas -1

pandas: high performance structured data manipulation