Download - SQL and Shell Baseball Analysis
Shell and SQL Baseball and airport analysis Austin Kinion
Part 1 (i): Compute the number of outbound flights for each of the five airports OAK, SMF, LAX, SFO and JFK, and sort these counts from largest to smallest.
Done in Shell
#Get the data from the website $ curl -‐O http://eeyore.ucdavis.edu/stat141/Data/Airline2012_13.tar.gz
#output for the amount of time it took to get it % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 291M 100 291M 0 0 8803k 0 0:00:33 0:00:33 -‐-‐:-‐-‐:-‐-‐ 7512k
so it took 33 seconds to get the data from the website in Shell
#Grab each of the months csv files tar -‐zxvf Airline2012_13.tar.gz
#Create a file named ORGIN.txt with all of the origin columns pulled out for each month time cut -‐d , -‐f 15 201[23]_*.csv | sort | uniq -‐c | sort -‐nr > ORIGIN.txt The time it took was: user 1m55.666s
#Grab the names of the five airports out of the ORGIN file and count for each airport # and orer the counts Then put into new file ORIGIN2.txt time egrep 'OAK|SMF|LAX|SFO|JFK' ORIGIN.txt > ORIGIN2.txt The time it took was: real 0m0.004s
So that's a total time of about 2 minutes for the shell.
Taking a look at the file created from shell:
File: ORIGIN2.txt: 222029 "LAX" 169734 "SFO" 105097 "JFK" 44911 "OAK" 43145 "SMF"
Part 1 (i): Done in R
setwd("~/Desktop/Airline2012_13") list.files() #concatinate all five airport names into variable: airports airports = c("LAX", "SFO", "JFK", "OAK", "SMF") #call variable origincounts a blank vector with five 0's to fill in later origincounts = numeric(5) #Use filenames to later iterate over all files filenames = list.files() #loop over all csv files and pull out origin column system.time(for (i in 1:length(filenames)) { #Read in csv files 1:12 cur.csv = read.csv(filenames[i]) #Create a table with the origin counts for each airport origintable = table(cur.csv$ORIGIN) #Create a table with just the five airports that were asked for origincounts = origincounts + origintable[airports] })
The output matches that of the output from shell, so that is a very good thing.
#R output to show it matches the shell output: origincounts #LAX SFO JFK OAK SMF #222029 169734 105097 44911 43145
And the time it took was:
# user system elapsed # 438.322 15.523 458.014
So, if we use the user time, that is about 7.3 minutes, which is over 3 times as long as the time it took in shell. So I have determined, that when you need to grep a few items out of a very large database, that It is faster to do it in Shell over R.
Part 1 (ii) Compute the total number of flights in and out of the five airports, i.e., the sume of both the inbound and outbound flights. You can do this however you want using a mix of the shell and R code. One way is to first obtain the lines in the files which involve any of these five airports. Then obtain a count for each pair of airports, i.e., ORIGIN, DESTINATION pairs. At most, how many will there be? Then read these counts by ORIGIN, DESTINATION pairs into R and compute the total number of flights for each of the 5 airports.
First, the code that was done in shell:
#This grabs each of the airport names out of all the csv files from the data: egrep 'OAK|SMF|LAX|SFO|JFK' 201[23]*.csv > 12_13.txt
#Then this takes the content from the file created above and #pull outs both the origin and destination column, and sorts the counts cut -‐d , -‐f 15,25 12_13.txt | sort | uniq -‐c | sort -‐nr > ORIGIN_DEST.txt
So the counts for all five airports, whether a destination or an origin are all in the file above. I will now go use R to finish cleaning up the data and get the counts for each of the five airports
setwd("~/") data=readLines("ORIGIN_DEST.txt") #Grab each of the 3 letter codes for the airports out of the file LAX= data[grep("LAX", data)] JFK=data[grep("JFK", data)] SMF=data[grep("SMF", data)] SFO=data[grep("SFO", data)] OAK=data[grep("OAK", data)] #Name regular expression to make it easier for coding later regex='([^0-‐9])' #Substitute all the nonsense with nothing to just obtain the numbers LAX_NUM=gsub(regex, '', LAX) #Sum up those numbers obtained LAX_TOTAL=sum(as.numeric(LAX_NUM)) #Do the same thing as above for the rest of the airports JFK_NUM=gsub(regex, '', JFK) JFK_TOTAL=sum(as.numeric(JFK_NUM)) OAK_NUM=gsub(regex, '', OAK) OAK_TOTAL=sum(as.numeric(OAK_NUM)) SMF_NUM=gsub(regex, '', SMF) SMF_TOTAL=sum(as.numeric(SMF_NUM)) SFO_NUM=gsub(regex, '', SFO) SFO_TOTAL=sum(as.numeric(SFO_NUM)) #List each of the airports with their sums from above A= list(LAX=LAX_TOTAL, SFO=SFO_TOTAL, JFK=JFK_TOTAL, OAK=OAK_TOTAL, SMF=SMF_TOTAL) #Make result from above into data frame for readability result_air=as.data.frame(A)
And the output for this is:
LAX SFO JFK OAK SMF 444080 339463 210175 89820 86293
Baseball with SQL Number 1
What years does the data cover? are there data for each of these years?
#Find the years in which the data ranges, This function is from Nick's OH's year_range= function(tbl, db){ query= 'SELECT yearID FROM ' query = paste0(query, tbl, ';') cat(paste0('Query:', query, '\n')) # Use tryCatch() to catch errors. tryCatch(dbGetQuery(db, query), error = function(e) NULL) }
tables= dbListTables(db)
years = lapply(tables, year_range, db) u= unlist(years)
first_year = min(u) last_year = max(u)
So the data begins in the year:
>first_year 1871
And ends in the Year:
>last_year 2013 #So the data ranges from 1871-‐2013
Are there data for al the years?
#If returns TRUE, then yes: all(seq(min(u), max(u)) %in% u) [1] TRUE
So there are data for each of the years in the Set
Number 2
How many (unique) people are included in the database? How many are players, managers, etc?
#help from Piazza was given to make sure count was correct number_unique= function(tbl, db){ query= 'SELECT playerID FROM ' query = paste0(query, tbl, ';') cat(paste0('Query:', query, '\n')) # Use tryCatch() to catch errors. tryCatch(dbGetQuery(db, query), error = function(e) NULL) } uni_people = lapply(tables, number_unique, db) playerIDs = unlist(people) unique_people = length(unique(playerIDs)) unique_people managers = dbGetQuery(db, 'SELECT COUNT(DISTINCT playerID) FROM Managers') managers Number_players = unique_people-‐managers
Found 18359 unique in the set, 682 Managers, and 17677 players. Some players were managers at some point, and found that there might be some overlap in the tables, but I believe that this is a very good estimate of the number of players and managers from all of the years.
Number 3
What team won the World Series in 2000?
Win_2000= dbGetQuery(db, "SELECT name FROM Teams WHERE WSWin = 'Y' and yearID = '2000';")
>Win_2000 name 1 New York Yankees
The winner of the World Series in 2000 was the New York Yankees
Number 4What team lost the World Series each year?
World_Series_Losers = dbGetQuery(db, "SELECT yearID, name FROM Teams WHERE LGWin = 'Y' and WSWin = 'N' GROUP BY yearID;")
For the sake of saving paper, I will just show the first 5 years of world series losers and the last five years:
yearID name 1 1884 New York Metropolitans 2 1885 Chicago White Stockings 3 1886 Chicago White Stockings 4 1887 St. Louis Browns 5 1888 St. Louis Browns ... ... ... 112 2009 Philadelphia Phillies 113 2010 Texas Rangers 114 2011 Texas Rangers 115 2012 Detroit Tigers 116 2013 St. Louis Cardinal
Number 5
Do you see a relationship between the number of games won in a season and winning the World Series?
#I recieved a lot of help from Charles on this problem. World_Series_Winners = dbGetQuery(db, "SELECT WSWin,teamID,W, yearID FROM Teams WHERE LGWin = 'Y' AND WSWin = 'Y' GROUP BY TeamID;") World_Series_win= World_Series_Winners[,3] World_Series_year= World_Series_Winners[,4] plot(World_Series_year,World_Series_win, xlab="Year", ylab="Number of Wins")
World_Series_Losers = dbGetQuery(db, "SELECT WSWin,teamID,W, yearID FROM Teams WHERE WSWin = 'N' GROUP BY TeamID;") World_Series_win2= World_Series_Losers[,3] World_Series_year2= World_Series_Losers[,4]
plot(World_Series_year2,World_Series_win2, xlab="Year", ylab="Number of Wins")
So there does seem to be a relationship for numbers of wins and winning the world series, since the plots differ substantially. It is apparent that there is a higher number of wins for the teams who won the world series, so therefore they differ. just to make sure that My plot reading skills are correct, I computer the mean and median for the number f wins in each catagory:
median(World_Series_win) 95 mean(World_Series_win) 95.16 median(World_Series_win2) 70 mean(World_Series_win2) 67
So this just clarifies my point from above that three are a higher number of wins for the teams who won the world series.
Number 6
In 2003, what were the three highest salaries?
high_salary = dbGetQuery(db, "SELECT salary FROM Salaries WHERE yearID = '2003' ORDER BY salary DESC limit 3; ")
>high_salary salary 1 22000000 2 20000000 3 18700000
So the three highest slaries are: $22,000,000, $20,000,000, and $18,700,000.
Number 7
For 1999, compute the total payroll of each of the different teams. Next compute the team payrolls for all years in the database for which we have salary information.
Payroll_99= dbGetQuery(db, "SELECT teamID,sum(salary) FROM Salaries WHERE yearID='1999' GROUP BY teamID;")
So the payroll for the year 1999 for each of the teams is:
>Payroll_99
teamID sum(salary) 1 ANA 55388166 2 ARI 68703999 3 ATL 73140000 4 BAL 80605863 5 BOS 63497500 6 CHA 25620000 7 CHN 62343000 8 CIN 33962761 9 CLE 72978462 … … ……
10 COL 61935837 11 DET 36489666 12 FLO 21085000
13 HOU 54914000 14 KCA 26225000 15 LAN 80862453 16 MIL 43377395 17 MIN 21257500 18 MON 17903000 19 NYA 86734359 20 NYN 65092092 21 OAK 24431833 22 PHI 31692500 23 PIT 24697666 24 SDN 49768179 25 SEA 54125003 26 SFN 46595057 27 SLN 49778195 28 TBA 38870000 29 TEX 76709931 30 TOR 45444333
``````````````````````````````````````````````````````````````````
Payroll= dbGetQuery(db, "SELECT teamID,sum(salary), yearID FROM Salaries GROUP BY teamID, yearID;")
For the sake of paper, I will just display the first 5 salaries from th first five teams, and the last 5:
teamID sum(salary) yearID 1 ATL 14807000 1985 2 BAL 11560712 1985 3 BOS 10897560 1985 4 CAL 14427894 1985 5 CHA 9846178 1985 ... ... ... ... 824 SLN 92260110 2013 825 TBA 52955272 2013 826 TEX 112522600 2013 827 TOR 126288100 2013 828 WAS 113703270 2013
Number 8
Study the change in salary over time. Have salaries kept up with inflation, fallen behind, or grown faster?
#bring all dollars to 1985 dollars. CPI=read.table("CPI.txt")
#Multiply the cpi values by salary and then plot. #Much help From Nick was recieved to answer this question year_salary=dbGetQuery(db, "SELECT yearID, sum(salary) AS salary FROM Salaries GROUP BY yearID")
CPI2=year_salary/CPI CPI2=t(CPI2) #Take the transpose so it will work year=as.vector(year$yearID) #For the lines statement
plot(year_salary, type='l', col="blue", lwd=2.5, xlab="Year", ylab="Salary") lines(year,CPI2, type='l',col="red", lwd=2.5) legend(1985,3.0e+09, c("Salary","Salary with no inflation"), # puts text in the legend lty=c(1,1), # gives the legend appropriate symbols (lines) lwd=c(1,1),col=c("blue","red"), cex=.55)
Above is a plot which shows the increase in salary since 1985 (in blue), and the increase in salary if the salary were computed in 1985 dollars (in red). Another way of exapling the red line is that it is the salary with no inflation rate. It is clear that the salary for MLB players in increasing much faster than inflation, so players are getting paid a lot more money than they did in 1985.
Number 9
Compare payrolls for the teams that are in the same leagues, and then in the same divisions. Are there any interesting characteristics? Have certain teams always had top payrolls over the years? Is there a connection between payroll and performance?
#American League Salary library(reshape) library(reshape2) American_L_Sal= dbGetQuery(db, "SELECT teamID, sum(salary), yearID FROM Salaries WHERE lgID= 'AL' GROUP BY teamID, yearID") Teams= unique(American_L_Sal[,1]) names(American_L_Sal)= c("Team", "Salary", "Year") m=melt(American_L_Sal,id=c("Team", "Year")) c=cast(m, Year~Team) matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main=American League) legend("topleft", legend=Teams, cex=.4, col=1:68, pch=.5, lty=.5, lwd=1)
Above, we can see the salaries for each of the American League teams. This graph and th one's following took me several hours and I am very proud of them. The graph above is not the easiest to read, but I think it is still plenty readable. We can see that the highest paid team for the American league (the recent years) is undoubtably the Texas Rangers.
#National League Salary National_L_Sal= dbGetQuery(db, "SELECT teamID, sum(salary), yearID FROM Salaries WHERE lgID= 'NL' GROUP BY teamID, yearID") Teams= unique(National_L_Sal[,1]) names(National_L_Sal)= c("Team", "Salary", "Year") m=melt(National_L_Sal,id=c("Team", "Year")) c=cast(m, Year~Team) matplot(c, type="l", col=55:68, xlab="Year", ylab="Salary", main="Nat. League") legend("topleft", legend=Teams, cex=.4, col=40:68, pch=.5, lty=.5, lwd=1)
Above we can see that Mostly in the last five years, for the National League, the LA Dogers have been the highest paid tem, until the last 2 years, where we can see that the New York Yankees salaries have boosted greatly.
#Now the devisions: #American League West: Used Nick's code from discussion, recieved help from Michael in OH's American_LW_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary), a.yearID FROM Salaries AS a, Teams as b WHERE a.teamID = b.teamID AND a.yearID = b.yearID AND a.lgID = b.lgID AND a.lgID= 'AL' AND b.divID= 'W' GROUP BY a.yearID, a.teamID;")
Teams= unique(American_LW_Sal[,1])
names(American_LW_Sal)= c("Team", "Salary", "Year") m=melt(American_LW_Sal,id=c("Team", "Year")) c=cast(m, Year~Team)
matplot(c, type="l", col=50:60, xlab="Year", ylab="Salary", main="AL WEST") legend("topleft", legend=Teams, cex=.6, col=50:60, pch=.5, lty=.5, lwd=1)
Highest paid team in last 5 years looks to be Seatle Mariners for AL West
#American League East American_LE_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary), a.yearID FROM Salaries AS a, Teams as b WHERE a.teamID = b.teamID AND a.yearID = b.yearID AND a.lgID = b.lgID AND a.lgID= 'AL' AND b.divID= 'E' GROUP BY a.yearID, a.teamID;")
Teams= unique(American_LE_Sal[,1]) names(American_LE_Sal)= c("Team", "Salary", "Year")
m=melt(American_LE_Sal,id=c("Team", "Year")) c=cast(m, Year~Team)
matplot(c, type="l", col=50:60, xlab="Year", ylab="Salary", main="AL EAST") legend("topleft", legend=Teams, cex=.6, col=50:60, pch=.5, lty=.5, lwd=1)
Highest paid team in last 5 years looks to be NY Yankees for AL East
#American League Central American_LC_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary), a.yearID FROM Salaries AS a, Teams as b WHERE a.teamID = b.teamID AND a.yearID = b.yearID AND a.lgID = b.lgID AND a.lgID= 'AL' AND b.divID= 'C' GROUP BY a.yearID, a.teamID;") Teams= unique(American_LC_Sal[,1]) names(American_LC_Sal)= c("Team", "Salary", "Year")
m=melt(American_LC_Sal,id=c("Team", "Year")) c=cast(m, Year~Team)
matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main="AL Central") legend("topleft", legend=Teams, cex=.6, col=1:68, pch=.5, lty=.5, lwd=1)
Highest paid team in last 5 years looks to be Kansas City Royals for AL Central
#National League West National_LW_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary), a.yearID FROM Salaries AS a, Teams as b WHERE a.teamID = b.teamID AND a.yearID = b.yearID AND a.lgID = b.lgID AND a.lgID= 'NL' AND b.divID= 'W' GROUP BY a.yearID, a.teamID;")
Teams= unique(National_LW_Sal[,1]) names(National_LW_Sal)= c("Team", "Salary", "Year")
m=melt(National_LW_Sal,id=c("Team", "Year")) c=cast(m, Year~Team)
matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main=”NL WEST”) legend("topleft", legend=Teams, cex=.6, col=1:68, pch=.5, lty=.5, lwd=1)
Highest paid team in last 5 years looks to be Arizona Diamondback for NL West, but SF Giants seem to have become the highest from 2012-‐2013.
#National League East National_LE_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary), a.yearID FROM Salaries AS a, Teams as b WHERE a.teamID = b.teamID AND a.yearID = b.yearID AND a.lgID = b.lgID AND a.lgID= 'NL' AND b.divID= 'E' GROUP BY a.yearID, a.teamID;") Teams= unique(National_LE_Sal[,1]) names(National_LE_Sal)= c("Team", "Salary", "Year")
m=melt(National_LE_Sal,id=c("Team", "Year")) c=cast(m, Year~Team)
matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main="NL EAST") legend("topleft", legend=Teams, cex=.6, col=1:68, pch=.5, lty=.5, lwd=1)
Highest paid team in last 5 years looks to be Maimi Marlins for NL East.
#National League Central National_LC_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary), a.yearID FROM Salaries AS a, Teams as b WHERE a.teamID = b.teamID AND a.yearID = b.yearID AND a.lgID = b.lgID AND a.lgID= 'NL' AND b.divID= 'C' GROUP BY a.yearID, a.teamID;") Teams= unique(National_LC_Sal[,1]) names(National_LC_Sal)= c("Team", "Salary", "Year")
m=melt(National_LC_Sal,id=c("Team", "Year")) c=cast(m, Year~Team)
matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main="NL Central") legend("topleft", legend=Teams, cex=.6, col=1:68, pch=.5, lty=.5, lwd=1)
Highest paid team in last 5 years looks to be Chicago Cubs for NL Central, but CINCINNATI REDS seem to have become the highest from 2012-‐2013.
NUMBER 10
Has the distribution of home runs for players increased over the years?
#Number 10 home_run= dbGetQuery(db, "SELECT yearID,HR AS homerun FROM Batting ") a=split(home_run$homerun, home_run$yearID) boxplot(a, outwex=.2, outline=FALSE)
From the plot, we can see that the distrobution of homeruns HAS changed over the years, with a peak in the 90's and early 2000's most likely due to steroid use being unregulated.
BONUS QUESTIONS! Have the RBI's in the last 13 years gone down due to Steroid use being enforced?
RBI= dbGetQuery(db, "SELECT yearID,sum(RBI) FROM Batting WHERE yearID BETWEEN 2000 AND 2013 GROUP BY yearID ") plot(RBI, type='l', ylab="RBI's", xlab="Year", main="RBI's since 2000" )
#Verify that Batting has worstened with Steroid decline by looking at Homeruns home_runs= dbGetQuery(db, "SELECT yearID,sum(HR) AS homerun FROM Batting WHERE yearID BETWEEN 2000 AND 2013 GROUP BY yearID") plot(home_runs, type='l', ylab="Homeruns", xlab="Year", main="Homeruns since 2000" )
So it is apparent that the RBI's have gone down recently, and the amount of homeruns, and this is likely due to the decrease of steroids over the past 10-‐15 years.
NUMBER 2 Look at the number of strikeouts over the years, have pitchers gotten better?
strike_outs= dbGetQuery(db, "SELECT yearID,sum(SO) AS homerun FROM Pitching GROUP BY yearID") plot(strike_outs, type='l', ylab="Strikeouts", xlab="Years", main="Strikeouts over the years")
It does appear that pitchers have gotten better over the the years, but that could also mean that batters have gotten worse while pitchers remained the same.
Question 3
Who are the 5 top managers (managerID's) with the highest number of wins in this dataset?
top_managers= dbGetQuery(db, "SELECT playerID,yearID,W AS wins FROM Managers ORDER BY W DESC limit 5")
top_managers #From baseballreference.com #Frank Chance, Lou Piniella, Joe Torre, AL Lopez, Fred Clarke
Number 4
In which years were there tie games? and how many were there?
dbGetQuery(db, "SELECT sum(ties), yearID AS year FROM SeriesPost WHERE ties=1 GROUP BY yearID;")
sum(ties) year 1 1 1885 2 1 1890 3 1 1892
Number 5
Who are all of the pitchers in the MLB for the year 2013 and which team did they play for?
pitchers=dbGetQuery(db, "SELECT teamID, playerID,yearID, Pos FROM FieldingPost WHERE yearID=2013 AND Pos='P';")
There are 166 Pitchers, so I will just list the first 5 and the last 5:
teamID playerID yearID POS 1 DET albural01 2013 P 2 DET albural01 2013 P 3 CLE allenco01 2013 P 4 DET alvarjo02 2013 P 5 OAK anderbr04 2013 P ... ... ...... ... .. 162 BOS workmbr01 2013 P 163 BOS workmbr01 2013 P 164 BOS workmbr01 2013 P 165 TBA wrighja01 2013 P 166 TBA wrighwe01 2013 P