python data structures - best in class for data analysis

12
Data Structures - Best In Class for Data Science Rajesh Manickadas July 2016

Upload: rajesh-manickadas

Post on 24-Jan-2017

251 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Python data structures -   best in class for data analysis

Python Data Structures -

Best In Class for

Data Science

Rajesh ManickadasJuly 2016

Page 2: Python data structures -   best in class for data analysis

ObjectiveThe Objective of this Presentation is to Introduce python data structures available for data science

Page 3: Python data structures -   best in class for data analysis

Python Data Structures - PrimerA Refresher to Python Data Structures

Tuples Immutable Arrays

Lists Mutable Arrays

Dict Hashtables

More…

➔ Built-In Types➔ Data Type Modules➔ Numerical and

Mathematical Modules

Page 4: Python data structures -   best in class for data analysis

Python Data Structures - Functional Optimization PatternsThe Prime Objective is to optimize the data structures for functional programming optimization

Scalars are Python Objects designed with functional optimization patterns.

>>> a = 45>>> b = 45>>> id(a)16790784>>> id(b)16790784

A

B

45

16790784

No Arrays. List and Lists and List of Lists and List of List of Lists….

Good for Functional Work and Not Designed for Large Data Processing

Page 5: Python data structures -   best in class for data analysis

NdarraysStructure

NumPy Data Structures - ndarrays - BasicsNdarrays - Basic Modelling

MetaData

Data Buffer

Metadata Flexibility/Shape - Designed with Data Transformation Optimization patterns ex.

Transpose Reuse - Reuse of the Data Buffer ex. Views Dataype Encapsulation - Scalars

Data Buffer A Chunk of Memory starting at a particular location

Page 6: Python data structures -   best in class for data analysis

NumPy Data Structures - ndarray - Data Transformations

Ndarrays - Data Transformation Optimizations

PyArrayObect

typedef struct PyArrayObject {

PyObject_HEAD

char *data;

int nd;

npy_intp *dimensions;

npy_intp *strides;

PyObject *base;

PyArray_Descr *descr;

int flags;

PyObject *weakreflist;

} PyArrayObject;

>>> import numpy as np>>> matx = np.arange(15)>>> id(matx)139892166884368>>> mat3x5 = matx.reshape(3,5)>>> id(mat3x5)139892020117712>>> matxarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])>>> mat3x5array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]])>>> matx[4] = 100>>> matxarray([ 0, 1, 2, 3, 100, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])>>> mat3x5array([[ 0, 1, 2, 3, 100], [ 5, 6, 7, 8, 9], [ 10, 11, 12, 13, 14]])>>> _

Ndarray 1:matx

Ndarray 2:mat3x5

reshape

Dim:1strides:(8,)shape:(15,)

Dim:2strides:(40,8)shape:(3,5)

DATA

Page 7: Python data structures -   best in class for data analysis

NumPy Data Structures - More ConceptsThe More you know, The More you operate

Broadcasting

N-D Iterators

Indexing

Scalars

Routines

Shapes and Views

Page 8: Python data structures -   best in class for data analysis

Pandas - Where Python Meets the TablesFor what people see is what they manipulate

Series(1n)

DataFrame (2n)

Panels(3n) Tables

DataFrame

Data

Indexing

Set Algebra

Immutable

Ordered Set

Hash/Dict

Joins

Unions

Filters

Intersections

Page 9: Python data structures -   best in class for data analysis

Pandas - IndexingIndexing Is the Key Data Structure Element to Pandas

● Index is a PandasObject● The Motivation is to enable different implementation of Indexing - Custom

Indexing● Indexes are immutable● Multi Indexing/Hierarchical Indexing● Time Series - DateTime Indexes

Page 10: Python data structures -   best in class for data analysis

Pandas - Indexing a DataFrameIndexing Organization

Year Total Gas Liquid Solid

1997 250255 12561 66649 159191

1998 255310 12990 71750 158106

1999 271548 11549 77852 169087

2000 281389 11974 82834 172812

...

Label Index DateTime Index

Data

Array, ordered, immutable, hashtable,int64

Array, ordered, immutable, hashtable,timestamp

NdarraydatadypeIndex (axis)columns

Page 11: Python data structures -   best in class for data analysis

Pandas - Time Series - C02 Emissions in India (1858- 2014)Time Series Example

>>> import numpy as np>>> import pandas as pd>>> import matplotlib.pyplot as plt>>> dateparse = lambda dates: pd.datetime.strptime(dates, '%Y')>>> co2emission = pd.read_table('inco2.csv',delimiter=',',header='infer', parse_dates=True, index_col='Year',date_parser=dateparse)>>> co2emission.plot()<matplotlib.axes.AxesSubplot object at 0x7fd79d20bcd0>>>> plt.show()>>> co2solidemission = co2emission['Solid']>>> co2solidemission.plot()<matplotlib.axes.AxesSubplot object at 0x7fd79be3bf50>>>> plt.show()>>> co2solidemission.mean()50129.979310344825

Page 12: Python data structures -   best in class for data analysis

Pandas - More Concepts

● Set Algebra - SQL Joins, Indexing and Filtering● Categorical Data● I/0 Optimizations● R Integration● Panels