real world data analysis pandas · 5.03.2019  · real world data analysis pandas python package...

Post on 07-Jun-2020

55 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Real World Data AnalysisPANDAS

PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES

DR. SYED IMTIYAZ HASSANASSISTANT PROFESSOR, DEPARTMENT. OF CSE, JAMIA HAMDARD(DEEMED TO BE UNIVERSITY), NEW DELHI, INDIA.https://syedimtiyazhassan.orgs.imtiyaz@jamiahamdard.ac.inhttp://www.jamiahamdard.edu

INTRODUCTION

For fast, flexible, and expressive data structures.

Designed to make working with “relational” or “labeled” data.

Prepared from:

https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html

2

WELL SUITED FOR

Tabular data with heterogeneously-typed columns.

Ordered and unordered (not necessarily fixed-frequency) time series data.

Arbitrary matrix data (homogeneously typed orheterogeneous) with row and column labels.

Any other form of observational / statistical data sets.

The data actually need not be labeled at all to beplaced into a pandas data structure.

3

DATA STRUCTURES

Series: 1D labeled homogeneously-typed array.

DataFrame: General 2D labeled, size-mutabletabular structure with potentially heterogeneously-typed column.

4

• import numpy as np

• import pandas as pd

• s = pd.Series([1, 3, 5, np.nan, 6, 8])

• s

SERIES

A Series by passing a list of values, letting pandascreate a default integer index.

5

import numpy as npimport pandas as pds = pd.Series([1, 3, 5, np.nan, 6, 8])s

OBJECT CREATION

A DataFrame by passing a NumPy array, with a:

datetime index and

labeled columns.

NumPy arrays have one dtype for the entire array, while pandasDataFrames have one dtype per column.

labeled columns

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))df

datetime index

6

dates = pd.date_range('20130101', periods=6)dates

OBJECT CREATION A DataFrame by passing a dict of objects that can be converted

to series-like.

DataFrame

df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20130102'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3] * 4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'})

df2

7

df2.dtypes

VIEWING DATA df.head()

df.tail(3)

df.index

df.columns

df.describe()

df.T

df.to_numpy()

8

SORTING

By Axis

By Values

By Axis

df.sort_index(axis=1, ascending=False)

datetime index

9

df.sort_values(by='B')

SELECTION

Getting Selection by Label

df.loc

df.at

Selection by Position df.iloc

df.iat

Boolean Indexing

10

GETTING Selecting a single column, which yields a Series, equivalent todf.A

df['A']

Selecting via [], which slices the rows.

11

df[0:3]

df['20130102':'20130104']

df

df.A

SELECTION BY LABEL

Selecting on a multi-axis by label.

12

df1 = pd.DataFrame(np.random.randn(6, 4))

df1.loc[0]

df.loc[dates[0]]

df.loc[:, ['A', 'B']]

df.loc['20130102':'20130104', ['A', 'B']]

df.loc['20130102', ['A', 'B']]

df.loc[dates[0], 'A']

df.at[dates[0], 'A'] #Fast

SELECTION BY POSITION

13

df.iloc[3]

df.iloc[3:5, 0:2]

df.iloc[[1, 2, 4], [0, 2]]

df.iloc[1:3, :]

df.iloc[:, 1:3]

df.iat[1, 1]

df.iloc[1, 1]

BOOLEAN INDEXING

14

df[df.A > 0] df[df > 0]

df2 = df.copy()df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']df2

df2[df2['E'].isin(['two', 'four'])]

SETTING

15

df['F'] = s1s1

df.iat[0, 1] = 0df

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))

df.at[dates[0], 'A'] = 0df

df2 = df.copy()df2[df2 > 0] = -df2df2

MISSING DATA

df1 = df.copy()df1.dropna(how='any')

16

df1.fillna(value=5)

pd.isna(df1)

Drop any rows that have missing data.

Filling missing data.

Get the Boolean mask where values are nan.

OPERATIONS

Stats

Apply

Concat

Join

Append

Grouping

17

STATSAPPLY

df.mean()

18

df.mean(1)

df.apply(np.cumsum)

Same operation on the other axis

Operations in general exclude missing data.

df.apply(lambda x: x.max() - x.min())

Applying functions to the data.

HISTOGRAM

s = pd.Series(np.random.randint(0, 7, size=10))s

19

s.value_counts()

CONCAT

df = pd.DataFrame(np.random.randn(10, 4))

df

20

pieces = [df[:3], df[3:7], df[7:]]pieces

pd.concat(pieces)

JOIN

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})left

21

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

right

pd.merge(left, right, on='key')

JOIN

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})left

22

right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})right

pd.merge(left, right, on='key')

APPEND

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])df

23

s = df.iloc[3]s

df.append(s, ignore_index=True)

GROUPING

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})

df

24

df.groupby('A').sum()

df.groupby(['A', 'B']).sum()

PLOTTING

25

PLOTTING

26

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])

df

df.plot()

df.plot()

df = df.cumsum()

DATA FILES

27

Format Type Data Description Reader Writer

text CSV read_csv to_csv

text JSON read_json to_json

text HTML read_html to_html

text Local clipboard read_clipboard to_clipboard

binary MS Excel read_excel to_excel

binary HDF5 Format read_hdf to_hdf

binary Feather Format read_feather to_feather

binary Parquet Format read_parquet to_parquet

binary Msgpack read_msgpack to_msgpack

binary Stata read_stata to_stata

binary SAS read_sas

binary Python Pickle Format read_pickle to_pickle

SQL SQL read_sql to_sql

SQL Google Big Query read_gbq to_gbq

df.to_csv('foo.csv')

pd.read_csv('foo.csv')

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

df.to_excel('foo.xlsx', sheet_name='Sheet1')

THANK YOU

top related