python for data science · 02/03/2020 matplotlib-tutorials localhost:8889/lab 1/ 32 python for data...
TRANSCRIPT
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 1/32
PYTHON FOR DATA SCIENCE
Visaulisation
Matplotlib & SeabornMatplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopyformats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Pythonand IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Third party packages
A large number of third party packages extend and build on Matplotlib functionality, including several higher-level plotting interfaces (seaborn, holoviews, ggplot, ...), and two projection and mapping toolkits (basemapand cartopy).
matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Eachpyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure,plots some lines in a plotting area, decorates the plot with labels, etc.
In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things like thecurrent figure and plotting area, and the plotting functions are directed to the current axes (please note that"axes" here and in most places in the documentation refers to the axes part of a figure and not the strictmathematical term for more than one axis).
Tip: In Jupyter Notebook, you can also include %matplotlib inline to display your plots inside your notebook.
Load the required libraries
In [101]:
import numpy as npimport pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #plt.plot?
Plot a point
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 2/32
In [82]:
plt.plot(4, 3, '.')
Plot number of pointsIn [102]:
x = np.array([2,4,6,8,10,12,14,16])y = x/2
plt.figure(figsize=(10,5))plt.scatter(x, y, c='green') plt.show()
Out[82]:
[<matplotlib.lines.Line2D at 0x1a201c6d68>]
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 3/32
Label the axes
In [104]:
plt.plot([1, 2, 3, 4])plt.ylabel('vertical')plt.xlabel('horizontal')plt.show()
Create multiple plots with subplots
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 4/32
In [129]:
names = ['class1', 'class2', 'class3','class4','class5','class6','class7']scores = [10,15,20,30,40,50, 100]
plt.figure(figsize=(15, 8))
plt.subplot(131) #find the meaning of the parameter inside the subplot function plt.bar(names, scores)plt.subplot(132)plt.scatter(names, scores)plt.subplot(133)plt.plot(names, scores)plt.suptitle('Categorical Plotting') #you can give titles,xlabels and ylabels to each of the plots as wellplt.show()
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 5/32
What are the differences between add_axes and add_subplot?
The calling signature of add_axes is add_axes(rect), where rect is a list [x0, y0, width, height] denoting thelower left point of the new axes in figure coodinates (x0,y0) and its width and height. So the axes ispositionned in absolute coordinates on the canvas
The calling signature of add_subplot does not directly provide the option to place the axes at a predefinedposition. It rather allows to specify where the axes should be situated according to a subplot grid. The usualand easiest way to specify this position is the 3 integer notation,
e.g. ax = fig.add_subplot(231)
In this example a new axes is created at the first position (1) on a grid of 2 rows and 3 columns. To produceonly a single axes, add_subplot(111) would be used (First plot on a 1 by 1 subplot grid). (In newer matplotlibversions, add_subplot()` without any arguments is possible as well.)
SeabornSeaborn comes with a large number of high-level interfaces and customized themes that matplotlib lacks asit becomes difficult to figure out the settings that make plots attractive.
Mostly, matplotlib functions don’t work well with dataframes as seaborn does.
NB: Seaborn visualisations are based on matplotlib
In [107]:
import seaborn as sns
Let's load a dataset to be used
In [108]:
ourdata=pd.read_excel("Pokemon.xls")
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 6/32
In [111]:
ourdata.head()
In [112]:
sns.lmplot(x='Attack', y='Defense', data=ourdata) #lmplot() function is used toquickly plot the Linear Relationship between two(2) variables. lm for linear regression modelplt.show()
No regression line and adding hue
Setting fit_reg=False to remove the regression line
Out[111]:
Name Type1 Type 2 Total HP Attack Defense Atk Def Speed Stage Legenda
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 Fal
1 Ivysaur Grass Poison 405 60 62 63 80 80 60 2 Fal
2 Venusaur Grass Poison 525 80 82 83 100 100 80 3 Fal
3 Charmander Fire NaN 309 39 52 43 60 50 65 1 Fal
4 Charmeleon Fire NaN 405 58 64 58 80 65 80 2 Fal
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 7/32
In [43]:
sns.lmplot(x='Attack', y='Defense', data=ourdata,fit_reg=False,hue='Stage')
We set hue='Stage' to color our points by the Pokémon's evolution stage. This hue argument is very usefulbecause it allows you to express a third dimension of information using color.
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x1a1e37a860>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 8/32
In [44]:
fig = plt.figure()a1 = fig.add_axes([0,0,1,1]) #The calling signature of add_axes is add_axes(rect), where rect is a list [x0, y0, width, height] denoting the lower left point of the new axes in figure coodinates (x0,y0) and its width and height. So the axes is positionned in absolute coordinates on the canvas
x = np.arange(1,10)a1.plot(x, np.exp(x),'r')a1.set_title('range of numbers')plt.ylim(0,10000)plt.xlim(0,10)
#explicitly set x and y labelsplt.xlabel("x-axis") plt.ylabel('y-axis')plt.show()
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 9/32
In [113]:
ourdata.head()
Out[113]:
Name Type1 Type 2 Total HP Attack Defense Atk Def Speed Stage Legenda
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 Fal
1 Ivysaur Grass Poison 405 60 62 63 80 80 60 2 Fal
2 Venusaur Grass Poison 525 80 82 83 100 100 80 3 Fal
3 Charmander Fire NaN 309 39 52 43 60 50 65 1 Fal
4 Charmeleon Fire NaN 405 58 64 58 80 65 80 2 Fal
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 10/32
In [45]:
plt.figure(figsize=(15,15))sns.boxplot(data=ourdata)
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 11/32
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1df3a860>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 12/32
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 13/32
In [46]:
#drop the unnecessary clomns ourdata1=ourdata.drop(['Total','Legendary','Stage'],axis=1)ourdata1.head()
Out[46]:
Name Type 1 Type 2 HP Attack Defense Atk Def Speed
0 Bulbasaur Grass Poison 45 49 49 65 65 45
1 Ivysaur Grass Poison 60 62 63 80 80 60
2 Venusaur Grass Poison 80 82 83 100 100 80
3 Charmander Fire NaN 39 52 43 60 50 65
4 Charmeleon Fire NaN 58 64 58 80 65 80
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 14/32
In [114]:
plt.figure(figsize=(15,15))sns.boxplot(data=ourdata1)
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 15/32
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2634e908>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 16/32
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 17/32
In [ ]:
In [115]:
corr = ourdata1.corr() # Calculate correlations corr
Out[115]:
HP Attack Defense Atk Def Speed
HP 1.000000 0.306768 0.119782 0.236649 0.490978 -0.040939
Attack 0.306768 1.000000 0.491965 0.146312 0.369069 0.194701
Defense 0.119782 0.491965 1.000000 0.187569 0.139912 -0.053252
Atk 0.236649 0.146312 0.187569 1.000000 0.522907 0.411516
Def 0.490978 0.369069 0.139912 0.522907 1.000000 0.392656
Speed -0.040939 0.194701 -0.053252 0.411516 0.392656 1.000000
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 18/32
In [116]:
plt.figure(figsize=(10,10))sns.heatmap(corr) # Creating Heatmap
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 19/32
Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a26b2ff60>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 20/32
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 21/32
In [117]:
ourdata1.head()
Univariate Visualisation
DistplotThe most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function.By default, this will draw a histogram and fit a kernel density estimate (KDE). It is used basically for univariantset of observations and visualizes it through a histogram i.e. only one observation and hence we choose oneparticular column of the dataset.
In [118]:
sns.distplot(ourdata1['Defense'])
use boxplot to confirm your disttribution
Out[117]:
Name Type 1 Type 2 HP Attack Defense Atk Def Speed
0 Bulbasaur Grass Poison 45 49 49 65 65 45
1 Ivysaur Grass Poison 60 62 63 80 80 60
2 Venusaur Grass Poison 80 82 83 100 100 80
3 Charmander Fire NaN 39 52 43 60 50 65
4 Charmeleon Fire NaN 58 64 58 80 65 80
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a26bc97f0>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 22/32
In [51]:
sns.boxplot(ourdata1['Defense'])
You can explicitly turn off the kde
read about kde: https://pythontic.com/pandas/series-plotting/kernel%20density%20estimation%20plot(https://pythontic.com/pandas/series-plotting/kernel%20density%20estimation%20plot)
https://pythontic.com/pandas/dataframe-plotting/kernel%20density%20estimation%20plot(https://pythontic.com/pandas/dataframe-plotting/kernel%20density%20estimation%20plot)
https://en.wikipedia.org/wiki/Kernel_density_estimation(https://en.wikipedia.org/wiki/Kernel_density_estimation)
https://www.statsmodels.org/stable/examples/notebooks/generated/kernel_density.html(https://www.statsmodels.org/stable/examples/notebooks/generated/kernel_density.html)
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d590390>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 23/32
In [119]:
sns.distplot(ourdata1['Defense'],kde=False)
we can also use only the kde plot to plot only kde
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a26d92240>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 24/32
In [120]:
sns.kdeplot(ourdata1['Defense']) plt.show()
we can as well shade the kde for better visualisation by using shade=True
In [54]:
sns.kdeplot(ourdata1['Defense'], shade=True) #by saying shade=True allows you to shade the area under the curve for better viewplt.show()
bivariate distributions-visualisations
Jointplot
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 25/32
In [121]:
ourdata.head()
In [55]:
sns.jointplot(ourdata['Defense'], ourdata['Attack'])
As you can see a histogram plotted for Defense and another created for Attack, with a scatter plot createdbetween Defense and Attack
Out[121]:
Name Type1 Type 2 Total HP Attack Defense Atk Def Speed Stage Legenda
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 Fal
1 Ivysaur Grass Poison 405 60 62 63 80 80 60 2 Fal
2 Venusaur Grass Poison 525 80 82 83 100 100 80 3 Fal
3 Charmander Fire NaN 309 39 52 43 60 50 65 1 Fal
4 Charmeleon Fire NaN 405 58 64 58 80 65 80 2 Fal
Out[55]:
<seaborn.axisgrid.JointGrid at 0x1a1dac25f8>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 26/32
confirm their correlation
In [56]:
ourdata1[['Defense','Attack']].corr()
We can also explicitly set the 'kind' of visualisation to be displayed
e.g: kind= “scatter” or “reg” or “resid” or “kde” or “hex”
In [96]:
#NB: use shift and tab to get more info about a particular function
sns.jointplot(ourdata['Defense'], ourdata['Attack'], kind='kde')plt.show()
Out[56]:
Defense Attack
Defense 1.000000 0.491965
Attack 0.491965 1.000000
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 27/32
In [122]:
sns.jointplot(ourdata['Defense'], ourdata['Attack'], kind='reg')plt.show()
#good when you want to explain the residuals
Visaulising more than two variables: Pairwise Bivariate Distributions-UsingPairplot()
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 28/32
In [98]:
sns.pairplot(ourdata[['Defense','Attack','HP']],kind='scatter')plt.show()
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 29/32
You can change the diagonal kind
Let's try diag_kind='kde'
In [99]:
sns.pairplot(ourdata[['Defense','Attack','HP']], kind='scatter', diag_kind ='kde')plt.show()
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 30/32
In [61]:
ourdata.head()
Categorical Data Visualisation
In [123]:
data=pd.read_csv('Automobile.csv')
In [125]:
data.head()
Out[61]:
Name Type1 Type 2 Total HP Attack Defense Atk Def Speed Stage Legenda
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 Fal
1 Ivysaur Grass Poison 405 60 62 63 80 80 60 2 Fal
2 Venusaur Grass Poison 525 80 82 83 100 100 80 3 Fal
3 Charmander Fire NaN 309 39 52 43 60 50 65 1 Fal
4 Charmeleon Fire NaN 405 58 64 58 80 65 80 2 Fal
Out[125]:
symboling normalized_losses make fuel_type aspiration number_of_doors body_style
0 3 168 alfa-romero gas std two convertible
1 3 168 alfa-romero gas std two convertible
2 1 168 alfa-romero gas std two hatchback
3 2 164 audi gas std four sedan
4 2 164 audi gas std four sedan
5 rows × 26 columns
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 31/32
In [64]:
sns.stripplot(data['number_of_doors'], data['horsepower'])
Cars with 2 door are having higher horsepower than cars with 4 door
In [65]:
sns.boxplot(data['number_of_doors'], data['horsepower'])
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e4352b0>
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e37ab00>
02/03/2020 Matplotlib-TUTORIALS
localhost:8889/lab 32/32
In [66]:
sns.barplot(data['number_of_doors'], data['horsepower'])
Perform similar operations with the other variables
Read more: https://seaborn.pydata.org/introduction.html (https://seaborn.pydata.org/introduction.html)
MrBriit
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d823d30>