difference between r and python

34
7/23/2019 Difference Between R and Python http://slidepdf.com/reader/full/difference-between-r-and-python 1/34 pdfcrowd com open in browser PRO version Are you a developer? Try out the HTML to PDF API HOME   DATAQUEST R vs Python: head to head data analysis 07 OCT 2015 on comparison There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit. We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here . For each step in the analysis, we’ll show the

Upload: animesh-dubey

Post on 17-Feb-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 1/34

pdfcrowd comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

HOME   DATAQUEST

R vs Python: head to head dataanalysis07 OCT 2015 on comparison

There have been dozens of articles written comparing Python and R from a subjectivestandpoint. We’ll add our own views at some point, but this article aims to look at the

languages more objectively. We’ll analyze a dataset side by side in Python and R, and

show what code is needed in both languages to achieve the same result. This will let

us understand the strengths and weaknesses of each language without the

conjecture. At Dataquest, we teach both languages, and think both have a place in adata science toolkit.

We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014

season. You can download the file here. For each step in the analysis, we’ll show the

Page 2: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 2/34

Page 3: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 3/34

pdfcrowd comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

the end of this step, the csv file has been loaded by both languages into a dataframe.

Find the number of players

R

dim(nba)

Python

nba.shape

[1] 481 31

 

(481, 31)

 

Page 4: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 4/34

pdfcrowd comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

This prints out the number of players and the number of columns in each. We have

481  rows, or players, and 31  columns containing data on the players.

Look at the first row of the dataR

head(nba, 1)

Python

nba.head(1)

  player pos age bref_team_id

1 Quincy Acy SF 23 TOT

[output truncated]

 

Page 5: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 5/34

pdfcrowd comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

This is pretty much identical. Both print out the first row of the data, and the syntax

is very similar. Python is more object-oriented here, and head  is a method on the

dataframe object, and R has a separate head  function. This is a common theme you’ll

see as you start to do analysis with these languages, where Python is more object-

oriented, and R is more functional.

Find the average of each statisticLet’s find the average value for each statistic. The columns, as you can see, have

names like fg  (field goals made), and ast  (assists). These are the season statistics

for the player. If you want a fuller explanation of all the stats, look here.

  player pos age bref_team_id

0 Quincy Acy SF 23 TOT

[output truncated]

 

R

Page 6: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 6/34

pdfcrowd comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

sapply(nba, mean, na.rm=TRUE)

Python

nba.mean()

player NA

pos NA

age 26.5093555093555

bref_team_id NA

[output truncated]

 

age 26.509356

g 53.253638

gs 25.571726

[output truncated]

 

Page 7: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 7/34

pdfcrowd comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

There are some major differences in approach here. In both, we’re applying a

function across the dataframe columns. In python, the mean method on dataframes

will find the mean of each column by default.

In R, taking the mean of string values will just result in NA  – not available. However,

we do need to ignore NA  values when we take the mean (requiring us to pass

na.rm=TRUE  into the mean  function). If we don’t, we end up with NA  for the mean of 

columns like x3p. . This column is three point percentage. Some players didn’t take

three point shots, so their percentage is missing. If we try the mean  function in R, we

get NA  as a response, unless we specify na.rm=TRUE , which ignores NA  values when

taking the mean. The .mean()  method in Python already ignores these values by 

default.

Make pairwise scatterplotsOne common way to explore a dataset is to see how different columns correlate to

others. We’ll compare the ast , fg , and trb  columns.

Page 8: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 8/34

Page 9: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 9/34

Page 10: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 10/34

df di b PRO i Are you a developer? Try out the HTML to PDF API

Page 11: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 11/34

df di b PRO i A d l ? T t th HTML t PDF API

We get very similar plots in the end, but this shows how the R data science ecosystem

has many smaller packages (GGally  is a helper package for ggplot2, the most-used R

plotting package), and many more visualization packages in general. In Python,

matplotlib is the primary plotting package, and seaborn is a widely used layer over

matplotlib. With visualization in Python, there is usually one main way to do

something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for

instance).

Page 12: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 12/34df di b PRO i A d l ? T t th HTML t PDF API

Make clusters of the playersOne good way to explore this kind of data is to generate cluster plots. These will show

which players are most similar.

R

library(cluster)

set.seed(1)

isGoodCol <- function(col){

  sum(is.na(col)) == 0 && is.numeric(col)

}

goodCols <- sapply(nba, isGoodCol)

clusters <- kmeans(nba[,goodCols], centers=5)

labels <- clusters$cluster

Python

from sklearn.cluster import KMeans

Page 13: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 13/34

Page 14: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 14/34df di b PRO iAre you a developer? Try out the HTML to PDF API

Plot players by clusterWe can now plot out the players by cluster to discover patterns. One way to do this is

to first use PCA to make our data 2-dimensional, then plot it, and shade each point

according to cluster association.

R

nba2d <- prcomp(nba[,goodCols], center=TRUE)

twoColumns <- nba2d$x[,1:2]

clusplot(twoColumns, labels)

Python

from sklearn.decomposition import PCA

pca_2 = PCA(2)

plot_columns = pca_2.fit_transform(good_columns)

plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)

plt.show()

Page 15: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 15/34df di b PRO iAre you a developer? Try out the HTML to PDF API

Page 16: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 16/34

Are you a developer? Try out the HTML to PDF API

Page 17: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 17/34

pdfcrowd.comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

Made a scatter plot of our data, and shaded or changed the icon of the data according

to cluster. In R, the clusplot  function was used, which is part of the cluster library.

We performed PCA via the pccomp  function that is builtin to R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to

create the plot.

Split into training and testing setsIf we want to do supervised machine learning, it’s a good idea to split the data into

Page 18: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 18/34

source dataframe – this makes the code much more concise In R there are packages

Page 19: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 19/34

pdfcrowd.comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

source dataframe this makes the code much more concise. In R, there are packages

to make sampling simpler, but aren’t much more concise than using the built-in

sample  function. In both cases, we set a random seed to make the results

reproducible.

Univariate linear regressionLet’s say we want to predict number of assists per player from field goals made per

player.

R

fit <- lm(ast ~ fg, data=train)

predictions <- predict(fit, test)

Python

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(train[["fg"]], train["ast"])

Page 20: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 20/34

Page 21: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 21/34

Page 22: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 22/34

Page 23: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 23/34

Page 24: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 24/34

Page 25: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 25/34

import requests

Page 26: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 26/34

pdfcrowd.comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

In Python, the requests package makes downloading web pages easy, with aconsistent API for all request types. In R, RCurl provides a similarly simple way to

make requests. Both download the webpage to a character datatype. Note: this step is

unnecessary for the next step in R, but is shown for comparisons’s sake.

Extract player box scoresNow that we have the web page, we’ll need to parse it to extract scores for players.

p q

url = "http://www.basketball-reference.com/boxscores/201506140GSW.html"

data = requests.get(url).content

R

library(rvest)

page <- read_html(url)

table <- html_nodes(page, ".stats_table")[3]

rows <- html_nodes(table, "tr")

Page 27: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 27/34

  rows <- html_nodes(teamData, "tr")

Page 28: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 28/34

pdfcrowd.comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

  lapply(seq_along(rows), extractRow, rows=rows)

}

data <- lapply(teams, scrapeData)

Python

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(data, 'html.parser')

box_scores = []

for tag in soup.find_all(id=re.compile("[A-Z]{3,}_basic")):

  rows = []

  for i, row in enumerate(tag.find_all("tr")):

  if i == 0:

  continue

  elif i == 1:

  tag = "th"

Page 29: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 29/34

Page 30: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 30/34

As we saw from functions like lm , predict , and others, R lets functions do most of 

Page 31: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 31/34

pdfcrowd.comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

the work. Contrast this to the LinearRegression  class in Python, and the sample

method on dataframes.

R has more data analysis built-ins, Python relieson packagesWhen we looked at summary statistics, we could use the summary  built-in function in

R, but had to import the statsmodels  package in Python. The dataframe is a built-in

construct in R, but must be imported via the pandas  package in Python.

Python has “main” packages for data analysistasks, R has a larger ecosystem of smallpackagesWith Python, we can do linear regression, random forests, and more with the scikit-

learn package. It offers a consistent API, and is well-maintained. In R, we have a

greater diversity of packages, but also greater fragmentation and less consistency 

(linear regression is a builtin, lm , randomForest  is a separate package, etc).

Page 32: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 32/34

were inspired by R dataframes, the rvest package was inspired by BeautifulSoup),

Page 33: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 33/34

pdfcrowd.comopen in browser PRO version Are you a developer? Try out the HTML to PDF API

and both ecosystems continue to grow stronger. It’s remarkable how similar the

syntax and approaches are for many common tasks in both languages.

Last wordAt Dataquest, we primarily teach Python, but have recently been adding lessons on R.

We see both languages as complementary, and although we think Python is stronger

in more areas, R is an effective language. It can be used either as a complement for

Python in areas like data exploration and statistics, or as your sole data analysis tool.

As this walkthrough proves, both languages have a lot of similarities in syntax and

approach, and you can’t go wrong with one, the other, or both.

Liked the post? Visit Dataquest now to learn data science interactively in your

browser.

Page 34: Difference Between R and Python

7/23/2019 Difference Between R and Python

http://slidepdf.com/reader/full/difference-between-r-and-python 34/34