python data structures - best in class for data analysis

Post on 24-Jan-2017

251 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Python Data Structures -

Best In Class for

Data Science

Rajesh ManickadasJuly 2016

ObjectiveThe Objective of this Presentation is to Introduce python data structures available for data science

Python Data Structures - PrimerA Refresher to Python Data Structures

Tuples Immutable Arrays

Lists Mutable Arrays

Dict Hashtables

More…

➔ Built-In Types➔ Data Type Modules➔ Numerical and

Mathematical Modules

Python Data Structures - Functional Optimization PatternsThe Prime Objective is to optimize the data structures for functional programming optimization

Scalars are Python Objects designed with functional optimization patterns.

>>> a = 45>>> b = 45>>> id(a)16790784>>> id(b)16790784

A

B

45

16790784

No Arrays. List and Lists and List of Lists and List of List of Lists….

Good for Functional Work and Not Designed for Large Data Processing

NdarraysStructure

NumPy Data Structures - ndarrays - BasicsNdarrays - Basic Modelling

MetaData

Data Buffer

Metadata Flexibility/Shape - Designed with Data Transformation Optimization patterns ex.

Transpose Reuse - Reuse of the Data Buffer ex. Views Dataype Encapsulation - Scalars

Data Buffer A Chunk of Memory starting at a particular location

NumPy Data Structures - ndarray - Data Transformations

Ndarrays - Data Transformation Optimizations

PyArrayObect

typedef struct PyArrayObject {

PyObject_HEAD

char *data;

int nd;

npy_intp *dimensions;

npy_intp *strides;

PyObject *base;

PyArray_Descr *descr;

int flags;

PyObject *weakreflist;

} PyArrayObject;

>>> import numpy as np>>> matx = np.arange(15)>>> id(matx)139892166884368>>> mat3x5 = matx.reshape(3,5)>>> id(mat3x5)139892020117712>>> matxarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])>>> mat3x5array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]])>>> matx[4] = 100>>> matxarray([ 0, 1, 2, 3, 100, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])>>> mat3x5array([[ 0, 1, 2, 3, 100], [ 5, 6, 7, 8, 9], [ 10, 11, 12, 13, 14]])>>> _

Ndarray 1:matx

Ndarray 2:mat3x5

reshape

Dim:1strides:(8,)shape:(15,)

Dim:2strides:(40,8)shape:(3,5)

DATA

NumPy Data Structures - More ConceptsThe More you know, The More you operate

Broadcasting

N-D Iterators

Indexing

Scalars

Routines

Shapes and Views

Pandas - Where Python Meets the TablesFor what people see is what they manipulate

Series(1n)

DataFrame (2n)

Panels(3n) Tables

DataFrame

Data

Indexing

Set Algebra

Immutable

Ordered Set

Hash/Dict

Joins

Unions

Filters

Intersections

Pandas - IndexingIndexing Is the Key Data Structure Element to Pandas

● Index is a PandasObject● The Motivation is to enable different implementation of Indexing - Custom

Indexing● Indexes are immutable● Multi Indexing/Hierarchical Indexing● Time Series - DateTime Indexes

Pandas - Indexing a DataFrameIndexing Organization

Year Total Gas Liquid Solid

1997 250255 12561 66649 159191

1998 255310 12990 71750 158106

1999 271548 11549 77852 169087

2000 281389 11974 82834 172812

...

Label Index DateTime Index

Data

Array, ordered, immutable, hashtable,int64

Array, ordered, immutable, hashtable,timestamp

NdarraydatadypeIndex (axis)columns

Pandas - Time Series - C02 Emissions in India (1858- 2014)Time Series Example

>>> import numpy as np>>> import pandas as pd>>> import matplotlib.pyplot as plt>>> dateparse = lambda dates: pd.datetime.strptime(dates, '%Y')>>> co2emission = pd.read_table('inco2.csv',delimiter=',',header='infer', parse_dates=True, index_col='Year',date_parser=dateparse)>>> co2emission.plot()<matplotlib.axes.AxesSubplot object at 0x7fd79d20bcd0>>>> plt.show()>>> co2solidemission = co2emission['Solid']>>> co2solidemission.plot()<matplotlib.axes.AxesSubplot object at 0x7fd79be3bf50>>>> plt.show()>>> co2solidemission.mean()50129.979310344825

Pandas - More Concepts

● Set Algebra - SQL Joins, Indexing and Filtering● Categorical Data● I/0 Optimizations● R Integration● Panels

top related