python data structures - best in class for data analysis
TRANSCRIPT
![Page 1: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/1.jpg)
Python Data Structures -
Best In Class for
Data Science
Rajesh ManickadasJuly 2016
![Page 2: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/2.jpg)
ObjectiveThe Objective of this Presentation is to Introduce python data structures available for data science
![Page 3: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/3.jpg)
Python Data Structures - PrimerA Refresher to Python Data Structures
Tuples Immutable Arrays
Lists Mutable Arrays
Dict Hashtables
More…
➔ Built-In Types➔ Data Type Modules➔ Numerical and
Mathematical Modules
![Page 4: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/4.jpg)
Python Data Structures - Functional Optimization PatternsThe Prime Objective is to optimize the data structures for functional programming optimization
Scalars are Python Objects designed with functional optimization patterns.
>>> a = 45>>> b = 45>>> id(a)16790784>>> id(b)16790784
A
B
45
16790784
No Arrays. List and Lists and List of Lists and List of List of Lists….
Good for Functional Work and Not Designed for Large Data Processing
![Page 5: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/5.jpg)
NdarraysStructure
NumPy Data Structures - ndarrays - BasicsNdarrays - Basic Modelling
MetaData
Data Buffer
Metadata Flexibility/Shape - Designed with Data Transformation Optimization patterns ex.
Transpose Reuse - Reuse of the Data Buffer ex. Views Dataype Encapsulation - Scalars
Data Buffer A Chunk of Memory starting at a particular location
![Page 6: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/6.jpg)
NumPy Data Structures - ndarray - Data Transformations
Ndarrays - Data Transformation Optimizations
PyArrayObect
typedef struct PyArrayObject {
PyObject_HEAD
char *data;
int nd;
npy_intp *dimensions;
npy_intp *strides;
PyObject *base;
PyArray_Descr *descr;
int flags;
PyObject *weakreflist;
} PyArrayObject;
>>> import numpy as np>>> matx = np.arange(15)>>> id(matx)139892166884368>>> mat3x5 = matx.reshape(3,5)>>> id(mat3x5)139892020117712>>> matxarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])>>> mat3x5array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]])>>> matx[4] = 100>>> matxarray([ 0, 1, 2, 3, 100, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])>>> mat3x5array([[ 0, 1, 2, 3, 100], [ 5, 6, 7, 8, 9], [ 10, 11, 12, 13, 14]])>>> _
Ndarray 1:matx
Ndarray 2:mat3x5
reshape
Dim:1strides:(8,)shape:(15,)
Dim:2strides:(40,8)shape:(3,5)
DATA
![Page 7: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/7.jpg)
NumPy Data Structures - More ConceptsThe More you know, The More you operate
Broadcasting
N-D Iterators
Indexing
Scalars
Routines
Shapes and Views
![Page 8: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/8.jpg)
Pandas - Where Python Meets the TablesFor what people see is what they manipulate
Series(1n)
DataFrame (2n)
Panels(3n) Tables
DataFrame
Data
Indexing
Set Algebra
Immutable
Ordered Set
Hash/Dict
Joins
Unions
Filters
Intersections
![Page 9: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/9.jpg)
Pandas - IndexingIndexing Is the Key Data Structure Element to Pandas
● Index is a PandasObject● The Motivation is to enable different implementation of Indexing - Custom
Indexing● Indexes are immutable● Multi Indexing/Hierarchical Indexing● Time Series - DateTime Indexes
![Page 10: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/10.jpg)
Pandas - Indexing a DataFrameIndexing Organization
Year Total Gas Liquid Solid
1997 250255 12561 66649 159191
1998 255310 12990 71750 158106
1999 271548 11549 77852 169087
2000 281389 11974 82834 172812
...
Label Index DateTime Index
Data
Array, ordered, immutable, hashtable,int64
Array, ordered, immutable, hashtable,timestamp
NdarraydatadypeIndex (axis)columns
![Page 11: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/11.jpg)
Pandas - Time Series - C02 Emissions in India (1858- 2014)Time Series Example
>>> import numpy as np>>> import pandas as pd>>> import matplotlib.pyplot as plt>>> dateparse = lambda dates: pd.datetime.strptime(dates, '%Y')>>> co2emission = pd.read_table('inco2.csv',delimiter=',',header='infer', parse_dates=True, index_col='Year',date_parser=dateparse)>>> co2emission.plot()<matplotlib.axes.AxesSubplot object at 0x7fd79d20bcd0>>>> plt.show()>>> co2solidemission = co2emission['Solid']>>> co2solidemission.plot()<matplotlib.axes.AxesSubplot object at 0x7fd79be3bf50>>>> plt.show()>>> co2solidemission.mean()50129.979310344825
![Page 12: Python data structures - best in class for data analysis](https://reader034.vdocuments.mx/reader034/viewer/2022052514/5886e9531a28abba528b59eb/html5/thumbnails/12.jpg)
Pandas - More Concepts
● Set Algebra - SQL Joins, Indexing and Filtering● Categorical Data● I/0 Optimizations● R Integration● Panels