sql and shell baseball analysis

21
Shell and SQL Baseball and airport analysis Austin Kinion Part 1 (i): Compute the number of outbound flights for each of the five airports OAK, SMF, LAX, SFO and JFK, and sort these counts from largest to smallest. Done in Shell #Get the data from the website $ curl O http://eeyore.ucdavis.edu/stat141/Data/Airline2012_13.tar.gz #output for the amount of time it took to get it % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 291M 100 291M 0 0 8803k 0 0:00:33 0:00:33 :: 7512k so it took 33 seconds to get the data from the website in Shell #Grab each of the months csv files tar zxvf Airline2012_13.tar.gz #Create a file named ORGIN.txt with all of the origin columns pulled out for each month time cut d , f 15 201[23]_*.csv | sort | uniq c | sort nr > ORIGIN.txt The time it took was: user 1m55.666s #Grab the names of the five airports out of the ORGIN file and count for each airport # and orer the counts Then put into new file ORIGIN2.txt time egrep 'OAK|SMF|LAX|SFO|JFK' ORIGIN.txt > ORIGIN2.txt The time it took was: real 0m0.004s So that's a total time of about 2 minutes for the shell. Taking a look at the file created from shell: File: ORIGIN2.txt: 222029 "LAX" 169734 "SFO" 105097 "JFK" 44911 "OAK" 43145 "SMF"

Upload: austin-kinion

Post on 24-Dec-2015

28 views

Category:

Documents


1 download

DESCRIPTION

A brief analysis of airport data with Shell.A brief analysis of Baseball with SQL.

TRANSCRIPT

Page 1: SQL and Shell Baseball Analysis

Shell  and  SQL  Baseball  and  airport  analysis                                  Austin  Kinion  

Part 1 (i): Compute the number of outbound flights for each of the five airports OAK, SMF, LAX, SFO and JFK, and sort these counts from largest to smallest.

 Done  in  Shell  

#Get  the  data  from  the  website  $  curl  -­‐O  http://eeyore.ucdavis.edu/stat141/Data/Airline2012_13.tar.gz  

#output  for  the  amount  of  time  it  took  to  get  it      %  Total        %  Received  %  Xferd    Average  Speed      Time        Time          Time    Current                                                                    Dload    Upload      Total      Spent        Left    Speed  100    291M    100    291M        0          0    8803k            0    0:00:33    0:00:33  -­‐-­‐:-­‐-­‐:-­‐-­‐  7512k  

so  it  took  33  seconds  to  get  the  data  from  the  website  in  Shell  

#Grab  each  of  the  months  csv  files  tar  -­‐zxvf  Airline2012_13.tar.gz  

#Create  a  file  named  ORGIN.txt  with  all  of  the  origin  columns  pulled  out  for  each  month  time  cut  -­‐d  ,    -­‐f  15  201[23]_*.csv  |  sort  |  uniq  -­‐c  |  sort  -­‐nr  >  ORIGIN.txt    The  time  it  took  was:  user    1m55.666s  

#Grab  the  names  of  the  five  airports  out  of  the  ORGIN  file  and  count  for  each  airport  #  and  orer  the  counts  Then  put  into  new  file  ORIGIN2.txt  time  egrep  'OAK|SMF|LAX|SFO|JFK'  ORIGIN.txt  >  ORIGIN2.txt    The  time  it  took  was:  real    0m0.004s    

So  that's  a  total  time  of  about  2  minutes  for  the  shell.  

Taking  a  look  at  the  file  created  from  shell:  

File:  ORIGIN2.txt:  222029  "LAX"  169734  "SFO"  105097  "JFK"  44911  "OAK"  43145  "SMF"  

Page 2: SQL and Shell Baseball Analysis

Part  1  (i):  Done  in  R  

setwd("~/Desktop/Airline2012_13")    list.files()  #concatinate  all  five  airport  names  into  variable:  airports  airports  =  c("LAX",  "SFO",  "JFK",  "OAK",  "SMF")    #call  variable  origincounts  a  blank  vector  with  five  0's  to  fill  in  later  origincounts  =  numeric(5)    #Use  filenames  to  later  iterate  over  all  files  filenames  =  list.files()    #loop  over  all  csv  files  and  pull  out  origin  column  system.time(for  (i  in  1:length(filenames))  {      #Read  in  csv  files  1:12      cur.csv  =  read.csv(filenames[i])      #Create  a  table  with  the  origin  counts  for  each  airport      origintable  =  table(cur.csv$ORIGIN)      #Create  a  table  with  just  the  five  airports  that  were  asked  for      origincounts  =  origincounts  +  origintable[airports]  })  

The  output  matches  that  of  the  output  from  shell,  so  that  is  a  very  good  thing.  

#R  output  to  show  it  matches  the  shell  output:  origincounts  #LAX        SFO        JFK        OAK        SMF    #222029  169734  105097    44911    43145  

And  the  time  it  took  was:  

#      user      system    elapsed    #  438.322    15.523    458.014  

So,  if  we  use  the  user  time,  that  is  about  7.3  minutes,  which  is  over  3  times  as  long  as  the  time  it  took  in  shell.  So  I  have  determined,  that  when  you  need  to  grep  a  few  items  out  of  a  very  large  database,  that  It  is  faster  to  do  it  in  Shell  over  R.  

Part 1 (ii) Compute the total number of flights in and out of the five airports, i.e., the sume of both the inbound and outbound flights. You can do this however you want using a mix of the shell and R code. One way is to first obtain the lines in the files which involve any of these five airports. Then obtain a count for each pair of airports, i.e., ORIGIN, DESTINATION pairs. At most, how many will there be? Then read these counts by ORIGIN, DESTINATION pairs into R and compute the total number of flights for each of the 5 airports.

First,  the  code  that  was  done  in  shell:  

Page 3: SQL and Shell Baseball Analysis

#This  grabs  each  of  the  airport  names  out  of  all  the  csv  files  from  the  data:  egrep  'OAK|SMF|LAX|SFO|JFK'  201[23]*.csv  >  12_13.txt  

#Then  this  takes  the  content  from  the  file  created  above  and    #pull  outs  both  the  origin  and  destination  column,  and  sorts  the  counts  cut  -­‐d  ,    -­‐f  15,25  12_13.txt  |  sort  |  uniq  -­‐c  |  sort  -­‐nr  >  ORIGIN_DEST.txt  

So  the  counts  for  all  five  airports,  whether  a  destination  or  an  origin  are  all  in  the  file  above.  I  will  now  go  use  R  to  finish  cleaning  up  the  data  and  get  the  counts  for  each  of  the  five  airports  

setwd("~/")  data=readLines("ORIGIN_DEST.txt")    #Grab  each  of  the  3  letter  codes  for  the  airports  out  of  the  file  LAX=  data[grep("LAX",  data)]  JFK=data[grep("JFK",  data)]  SMF=data[grep("SMF",  data)]  SFO=data[grep("SFO",  data)]  OAK=data[grep("OAK",  data)]    #Name  regular  expression  to  make  it  easier  for  coding  later  regex='([^0-­‐9])'    #Substitute  all  the  nonsense  with  nothing  to  just  obtain  the  numbers  LAX_NUM=gsub(regex,  '',  LAX)  #Sum  up  those  numbers  obtained  LAX_TOTAL=sum(as.numeric(LAX_NUM))    #Do  the  same  thing  as  above  for  the  rest  of  the  airports  JFK_NUM=gsub(regex,  '',  JFK)  JFK_TOTAL=sum(as.numeric(JFK_NUM))    OAK_NUM=gsub(regex,  '',  OAK)  OAK_TOTAL=sum(as.numeric(OAK_NUM))    SMF_NUM=gsub(regex,  '',  SMF)  SMF_TOTAL=sum(as.numeric(SMF_NUM))    SFO_NUM=gsub(regex,  '',  SFO)  SFO_TOTAL=sum(as.numeric(SFO_NUM))    #List  each  of  the  airports  with  their  sums  from  above    A=  list(LAX=LAX_TOTAL,  SFO=SFO_TOTAL,  JFK=JFK_TOTAL,  OAK=OAK_TOTAL,  SMF=SMF_TOTAL)  #Make  result  from  above  into  data  frame  for  readability  result_air=as.data.frame(A)  

Page 4: SQL and Shell Baseball Analysis

And  the  output  for  this  is:  

 LAX        SFO        JFK      OAK      SMF  444080  339463  210175  89820  86293  

 

Baseball  with  SQL  Number  1  

What years does the data cover? are there data for each of these years?

#Find  the  years  in  which  the  data  ranges,  This  function  is  from  Nick's  OH's  year_range=  function(tbl,  db){            query=  'SELECT  yearID  FROM  '      query  =  paste0(query,  tbl,  ';')      cat(paste0('Query:',  query,  '\n'))      #  Use  tryCatch()  to  catch  errors.      tryCatch(dbGetQuery(db,  query),                          error  =  function(e)  NULL)  }    

tables=  dbListTables(db)    

years  =  lapply(tables,  year_range,  db)  u=  unlist(years)    

first_year  =  min(u)  last_year  =  max(u)  

So  the  data  begins  in  the  year:  

>first_year  1871  

And  ends  in  the  Year:  

>last_year  2013                  #So  the  data  ranges  from  1871-­‐2013  

Are  there  data  for  al  the  years?  

Page 5: SQL and Shell Baseball Analysis

#If  returns  TRUE,  then  yes:  all(seq(min(u),  max(u))  %in%  u)  [1]  TRUE  

So  there  are  data  for  each  of  the  years  in  the  Set  

Number  2  

How many (unique) people are included in the database? How many are players, managers, etc?

#help  from  Piazza  was  given  to  make  sure  count  was  correct  number_unique=  function(tbl,  db){            query=  'SELECT  playerID  FROM  '      query  =  paste0(query,  tbl,  ';')      cat(paste0('Query:',  query,  '\n'))      #  Use  tryCatch()  to  catch  errors.      tryCatch(dbGetQuery(db,  query),                          error  =  function(e)  NULL)  }  uni_people  =  lapply(tables,  number_unique,  db)  playerIDs  =  unlist(people)  unique_people  =  length(unique(playerIDs))  unique_people  managers  =  dbGetQuery(db,  'SELECT  COUNT(DISTINCT  playerID)  FROM  Managers')  managers  Number_players  =  unique_people-­‐managers  

Found  18359  unique  in  the  set,  682  Managers,  and  17677  players.  Some  players  were  managers  at  some  point,  and  found  that  there  might  be  some  overlap  in  the  tables,  but  I  believe  that  this  is  a  very  good  estimate  of  the  number  of  players  and  managers  from  all  of  the  years.  

Number  3  

What team won the World Series in 2000?

Win_2000=  dbGetQuery(db,  "SELECT  name  FROM  Teams  WHERE  WSWin  =  'Y'  and  yearID  =  '2000';")  

>Win_2000                              name  1  New  York  Yankees  

The  winner  of  the  World  Series  in  2000  was  the  New  York  Yankees  

Number 4What team lost the World Series each year?

Page 6: SQL and Shell Baseball Analysis

World_Series_Losers  =  dbGetQuery(db,  "SELECT  yearID,  name  FROM  Teams  WHERE  LGWin  =  'Y'  and  WSWin  =  'N'  GROUP  BY  yearID;")  

For  the  sake  of  saving  paper,  I  will  just  show  the  first  5  years  of  world  series  losers  and  the  last  five  years:  

         yearID                          name  1          1884    New  York  Metropolitans  2          1885  Chicago  White  Stockings  3          1886  Chicago  White  Stockings  4          1887                St.  Louis  Browns  5          1888                St.  Louis  Browns  ...      ...                                  ...  112      2009      Philadelphia  Phillies  113      2010                      Texas  Rangers  114      2011                      Texas  Rangers  115      2012                    Detroit  Tigers  116      2013          St.  Louis  Cardinal  

Number 5

Do you see a relationship between the number of games won in a season and winning the World Series?

#I  recieved  a  lot  of  help  from  Charles  on  this  problem.  World_Series_Winners  =  dbGetQuery(db,  "SELECT  WSWin,teamID,W,  yearID  FROM  Teams  WHERE  LGWin  =  'Y'  AND  WSWin  =  'Y'  GROUP  BY  TeamID;")    World_Series_win=  World_Series_Winners[,3]  World_Series_year=  World_Series_Winners[,4]    plot(World_Series_year,World_Series_win,  xlab="Year",  ylab="Number  of  Wins")  

 

World_Series_Losers  =  dbGetQuery(db,  "SELECT  WSWin,teamID,W,  yearID  FROM  Teams  WHERE    WSWin  =  'N'  GROUP  BY  TeamID;")  World_Series_win2=  World_Series_Losers[,3]  World_Series_year2=  World_Series_Losers[,4]    

Page 7: SQL and Shell Baseball Analysis

plot(World_Series_year2,World_Series_win2,  xlab="Year",  ylab="Number  of  Wins")  

 

So  there  does  seem  to  be  a  relationship  for  numbers  of  wins  and  winning  the  world  series,  since  the  plots  differ  substantially.  It  is  apparent  that  there  is  a  higher  number  of  wins  for  the  teams  who  won  the  world  series,  so  therefore  they  differ.  just  to  make  sure  that  My  plot  reading  skills  are  correct,  I  computer  the  mean  and  median  for  the  number  f  wins  in  each  catagory:  

median(World_Series_win)  95  mean(World_Series_win)  95.16  median(World_Series_win2)  70  mean(World_Series_win2)  67  

So  this  just  clarifies  my  point  from  above  that  three  are  a  higher  number  of  wins  for  the  teams  who  won  the  world  series.  

Number  6    

In 2003, what were the three highest salaries?

high_salary  =  dbGetQuery(db,  "SELECT  salary  FROM  Salaries  WHERE  yearID  =  '2003'  ORDER  BY  salary  DESC  limit  3;  ")    

>high_salary        salary  1  22000000  2  20000000  3  18700000  

So  the  three  highest  slaries  are:  $22,000,000,  $20,000,000,  and  $18,700,000.  

Number 7

Page 8: SQL and Shell Baseball Analysis

For 1999, compute the total payroll of each of the different teams. Next compute the team payrolls for all years in the database for which we have salary information.

Payroll_99=  dbGetQuery(db,  "SELECT  teamID,sum(salary)  FROM  Salaries  WHERE  yearID='1999'  GROUP  BY  teamID;")  

So  the  payroll  for  the  year  1999  for  each  of  the  teams  is:  

>Payroll_99  

       teamID    sum(salary)  1          ANA        55388166  2          ARI        68703999  3          ATL        73140000  4          BAL        80605863  5          BOS        63497500  6          CHA        25620000  7          CHN        62343000  8          CIN        33962761  9          CLE        72978462  …            …                ……  

10        COL        61935837  11        DET        36489666  12        FLO        21085000  

13        HOU        54914000  14        KCA        26225000  15        LAN        80862453  16        MIL        43377395  17        MIN        21257500  18        MON        17903000  19        NYA        86734359  20        NYN        65092092  21        OAK        24431833  22        PHI        31692500  23        PIT        24697666  24        SDN        49768179  25        SEA        54125003  26        SFN        46595057  27        SLN        49778195  28        TBA        38870000  29        TEX        76709931  30        TOR        45444333  

``````````````````````````````````````````````````````````````````  

Payroll=  dbGetQuery(db,  "SELECT  teamID,sum(salary),  yearID  FROM  Salaries  GROUP  BY  teamID,  yearID;")  

For  the  sake  of  paper,  I  will  just  display  the  first  5  salaries  from  th  first  five  teams,  and  the  last  5:  

Page 9: SQL and Shell Baseball Analysis

teamID  sum(salary)  yearID  1            ATL        14807000      1985  2            BAL        11560712      1985  3            BOS        10897560      1985  4            CAL        14427894      1985  5            CHA          9846178      1985  ...      ...              ...              ...  824        SLN        92260110      2013  825        TBA        52955272      2013  826        TEX      112522600      2013  827        TOR      126288100      2013  828        WAS      113703270      2013  

 

Number 8

Study the change in salary over time. Have salaries kept up with inflation, fallen behind, or grown faster?

 

#bring  all  dollars  to  1985  dollars.  CPI=read.table("CPI.txt")    

#Multiply  the  cpi  values  by  salary  and  then  plot.  #Much  help  From  Nick  was  recieved  to  answer  this  question  year_salary=dbGetQuery(db,  "SELECT  yearID,  sum(salary)  AS  salary  FROM  Salaries  GROUP  BY  yearID")    

CPI2=year_salary/CPI  CPI2=t(CPI2)  #Take  the  transpose  so  it  will  work  year=as.vector(year$yearID)  #For  the  lines  statement  

 plot(year_salary,  type='l',  col="blue",  lwd=2.5,  xlab="Year",  ylab="Salary")  lines(year,CPI2,  type='l',col="red",  lwd=2.5)  legend(1985,3.0e+09,    c("Salary","Salary  with  no  inflation"),  #  puts  text  in  the  legend                  lty=c(1,1),  #  gives  the  legend  appropriate  symbols  (lines)                lwd=c(1,1),col=c("blue","red"),  cex=.55)  

Page 10: SQL and Shell Baseball Analysis

 

Above  is  a  plot  which  shows  the  increase  in  salary  since  1985  (in  blue),  and  the  increase  in  salary  if  the  salary  were  computed  in  1985  dollars  (in  red).  Another  way  of  exapling  the  red  line  is  that  it  is  the  salary  with  no  inflation  rate.  It  is  clear  that  the  salary  for  MLB  players  in  increasing  much  faster  than  inflation,  so  players  are  getting  paid  a  lot  more  money  than  they  did  in  1985.  

Number 9

Compare payrolls for the teams that are in the same leagues, and then in the same divisions. Are there any interesting characteristics? Have certain teams always had top payrolls over the years? Is there a connection between payroll and performance?

   #American  League  Salary  library(reshape)  library(reshape2)  American_L_Sal=  dbGetQuery(db,  "SELECT  teamID,  sum(salary),  yearID  FROM  Salaries  WHERE  lgID=  'AL'  GROUP  BY  teamID,  yearID")    Teams=  unique(American_L_Sal[,1])  names(American_L_Sal)=  c("Team",  "Salary",  "Year")  m=melt(American_L_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)  matplot(c,  type="l",  col=1:68,  xlab="Year",  ylab="Salary",  main=American  League)  legend("topleft",  legend=Teams,  cex=.4,  col=1:68,  pch=.5,  lty=.5,  lwd=1)  

Page 11: SQL and Shell Baseball Analysis

 

Above,  we  can  see  the  salaries  for  each  of  the  American  League  teams.  This  graph  and  th  one's  following  took  me  several  hours  and  I  am  very  proud  of  them.  The  graph  above  is  not  the  easiest  to  read,  but  I  think  it  is  still  plenty  readable.  We  can  see  that  the  highest  paid  team  for  the  American  league  (the  recent  years)  is  undoubtably  the  Texas  Rangers.  

 

 

 

#National  League  Salary  National_L_Sal=  dbGetQuery(db,  "SELECT  teamID,  sum(salary),  yearID  FROM  Salaries  WHERE  lgID=  'NL'  GROUP  BY  teamID,  yearID")    Teams=  unique(National_L_Sal[,1])  names(National_L_Sal)=  c("Team",  "Salary",  "Year")  m=melt(National_L_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)  matplot(c,  type="l",  col=55:68,  xlab="Year",  ylab="Salary",  main="Nat.  League")  legend("topleft",  legend=Teams,  cex=.4,  col=40:68,  pch=.5,  lty=.5,  lwd=1)  

Page 12: SQL and Shell Baseball Analysis

 

Above  we  can  see  that  Mostly  in  the  last  five  years,  for  the  National  League,  the  LA  Dogers  have  been  the  highest  paid  tem,  until  the  last  2  years,  where  we  can  see  that  the  New  York  Yankees  salaries  have  boosted  greatly.  

 

 

#Now  the  devisions:    #American  League  West:  Used  Nick's  code  from  discussion,  recieved  help  from  Michael  in  OH's  American_LW_Sal=  dbGetQuery(db,  "SELECT  a.teamID,  sum(a.salary),  a.yearID  FROM  Salaries  AS  a,  Teams  as  b  WHERE  a.teamID  =  b.teamID  AND  a.yearID  =  b.yearID  AND  a.lgID  =  b.lgID  AND    a.lgID=  'AL'  AND  b.divID=  'W'  GROUP  BY  a.yearID,  a.teamID;")    

Teams=  unique(American_LW_Sal[,1])    

names(American_LW_Sal)=  c("Team",  "Salary",  "Year")  m=melt(American_LW_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)    

matplot(c,  type="l",  col=50:60,  xlab="Year",  ylab="Salary",  main="AL  WEST")  legend("topleft",  legend=Teams,  cex=.6,  col=50:60,  pch=.5,  lty=.5,  lwd=1)  

Page 13: SQL and Shell Baseball Analysis

 

Highest  paid  team  in  last  5  years  looks  to  be  Seatle  Mariners  for  AL  West  

 

 

 

 

 

 

#American  League  East  American_LE_Sal=  dbGetQuery(db,  "SELECT  a.teamID,  sum(a.salary),  a.yearID  FROM  Salaries  AS  a,  Teams  as  b  WHERE  a.teamID  =  b.teamID  AND  a.yearID  =  b.yearID  AND  a.lgID  =  b.lgID  AND      a.lgID=  'AL'  AND  b.divID=  'E'  GROUP  BY  a.yearID,  a.teamID;")    

Teams=  unique(American_LE_Sal[,1])  names(American_LE_Sal)=  c("Team",  "Salary",  "Year")    

m=melt(American_LE_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)    

matplot(c,  type="l",  col=50:60,  xlab="Year",  ylab="Salary",    main="AL  EAST")  legend("topleft",  legend=Teams,  cex=.6,  col=50:60,  pch=.5,  lty=.5,  lwd=1)  

Page 14: SQL and Shell Baseball Analysis

 

Highest  paid  team  in  last  5  years  looks  to  be  NY  Yankees  for  AL  East  

 

 

 

 

 

 

 

 

#American  League  Central  American_LC_Sal=  dbGetQuery(db,  "SELECT  a.teamID,  sum(a.salary),  a.yearID  FROM  Salaries  AS  a,  Teams  as  b  WHERE  a.teamID  =  b.teamID  AND  a.yearID  =  b.yearID  AND  a.lgID  =  b.lgID  AND  a.lgID=  'AL'  AND  b.divID=  'C'  GROUP  BY  a.yearID,  a.teamID;")    Teams=  unique(American_LC_Sal[,1])  names(American_LC_Sal)=  c("Team",  "Salary",  "Year")    

m=melt(American_LC_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)    

matplot(c,  type="l",  col=1:68,  xlab="Year",  ylab="Salary",  main="AL  Central")  legend("topleft",  legend=Teams,  cex=.6,  col=1:68,  pch=.5,  lty=.5,  lwd=1)  

Page 15: SQL and Shell Baseball Analysis

 

Highest  paid  team  in  last  5  years  looks  to  be  Kansas  City  Royals  for  AL  Central  

 

 

 

 

 

 

 

#National  League  West  National_LW_Sal=  dbGetQuery(db,  "SELECT  a.teamID,  sum(a.salary),  a.yearID  FROM  Salaries  AS  a,  Teams  as  b  WHERE  a.teamID  =  b.teamID  AND  a.yearID  =  b.yearID  AND  a.lgID  =  b.lgID  AND  a.lgID=  'NL'  AND  b.divID=  'W'  GROUP  BY  a.yearID,  a.teamID;")    

Teams=  unique(National_LW_Sal[,1])  names(National_LW_Sal)=  c("Team",  "Salary",  "Year")    

m=melt(National_LW_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)    

matplot(c,  type="l",  col=1:68,  xlab="Year",  ylab="Salary",  main=”NL  WEST”)  legend("topleft",  legend=Teams,  cex=.6,  col=1:68,  pch=.5,  lty=.5,  lwd=1)  

Page 16: SQL and Shell Baseball Analysis

 

Highest  paid  team  in  last  5  years  looks  to  be  Arizona  Diamondback  for  NL  West,  but  SF  Giants  seem  to  have  become  the  highest  from  2012-­‐2013.  

 

 

 

 

 

 

#National  League  East  National_LE_Sal=  dbGetQuery(db,  "SELECT  a.teamID,  sum(a.salary),  a.yearID  FROM  Salaries  AS  a,  Teams  as  b  WHERE  a.teamID  =  b.teamID  AND  a.yearID  =  b.yearID  AND  a.lgID  =  b.lgID  AND  a.lgID=  'NL'  AND  b.divID=  'E'  GROUP  BY  a.yearID,  a.teamID;")    Teams=  unique(National_LE_Sal[,1])  names(National_LE_Sal)=  c("Team",  "Salary",  "Year")    

m=melt(National_LE_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)    

matplot(c,  type="l",  col=1:68,  xlab="Year",  ylab="Salary",  main="NL  EAST")  legend("topleft",  legend=Teams,  cex=.6,  col=1:68,  pch=.5,  lty=.5,  lwd=1)  

Page 17: SQL and Shell Baseball Analysis

 

Highest  paid  team  in  last  5  years  looks  to  be  Maimi  Marlins  for  NL  East.  

 

 

 

 

 

 

 

#National  League  Central  National_LC_Sal=  dbGetQuery(db,  "SELECT  a.teamID,  sum(a.salary),  a.yearID  FROM  Salaries  AS  a,  Teams  as  b  WHERE  a.teamID  =  b.teamID  AND  a.yearID  =  b.yearID  AND  a.lgID  =  b.lgID  AND  a.lgID=  'NL'  AND  b.divID=  'C'  GROUP  BY  a.yearID,  a.teamID;")    Teams=  unique(National_LC_Sal[,1])  names(National_LC_Sal)=  c("Team",  "Salary",  "Year")    

m=melt(National_LC_Sal,id=c("Team",  "Year"))  c=cast(m,  Year~Team)    

matplot(c,  type="l",  col=1:68,  xlab="Year",  ylab="Salary",  main="NL  Central")  legend("topleft",  legend=Teams,  cex=.6,  col=1:68,  pch=.5,  lty=.5,  lwd=1)  

Page 18: SQL and Shell Baseball Analysis

 

Highest  paid  team  in  last  5  years  looks  to  be  Chicago  Cubs  for  NL  Central,  but  CINCINNATI  REDS  seem  to  have  become  the  highest  from  2012-­‐2013.  

 

 

 

 

 

 

 

NUMBER  10  

Has the distribution of home runs for players increased over the years?

 

#Number  10  home_run=  dbGetQuery(db,  "SELECT  yearID,HR  AS  homerun  FROM  Batting  ")    a=split(home_run$homerun,  home_run$yearID)  boxplot(a,  outwex=.2,  outline=FALSE)  

Page 19: SQL and Shell Baseball Analysis

 

From  the  plot,  we  can  see  that  the  distrobution  of  homeruns  HAS  changed  over  the  years,  with  a  peak  in  the  90's  and  early  2000's  most  likely  due  to  steroid  use  being  unregulated.  

BONUS  QUESTIONS!  Have  the  RBI's  in  the  last  13  years  gone  down  due  to  Steroid  use  being  enforced?  

RBI=  dbGetQuery(db,  "SELECT  yearID,sum(RBI)  FROM  Batting  WHERE  yearID  BETWEEN  2000  AND  2013  GROUP  BY  yearID  ")  plot(RBI,  type='l',  ylab="RBI's",  xlab="Year",  main="RBI's  since  2000"  )  

 #Verify  that  Batting  has  worstened  with  Steroid  decline  by  looking  at  Homeruns  home_runs=  dbGetQuery(db,  "SELECT  yearID,sum(HR)  AS  homerun  FROM  Batting  WHERE  yearID  BETWEEN  2000  AND  2013  GROUP  BY  yearID")  plot(home_runs,  type='l',  ylab="Homeruns",  xlab="Year",  main="Homeruns  since  2000"  )  

Page 20: SQL and Shell Baseball Analysis

 

So  it  is  apparent  that  the  RBI's  have  gone  down  recently,  and  the  amount  of  homeruns,  and  this  is  likely  due  to  the  decrease  of  steroids  over  the  past  10-­‐15  years.    

NUMBER  2  Look  at  the  number  of  strikeouts  over  the  years,  have  pitchers  gotten  better?  

strike_outs=  dbGetQuery(db,  "SELECT  yearID,sum(SO)  AS  homerun  FROM  Pitching  GROUP  BY  yearID")  plot(strike_outs,  type='l',  ylab="Strikeouts",  xlab="Years",  main="Strikeouts  over  the  years")  

 

It  does  appear  that  pitchers  have  gotten  better  over  the  the  years,  but  that  could  also  mean  that  batters  have  gotten  worse  while  pitchers  remained  the  same.    

Question  3  

 Who  are  the  5  top  managers  (managerID's)  with  the  highest  number  of  wins  in  this  dataset?  

top_managers=  dbGetQuery(db,  "SELECT  playerID,yearID,W  AS  wins  FROM  Managers  ORDER  BY  W  DESC  limit  5")  

Page 21: SQL and Shell Baseball Analysis

top_managers    #From  baseballreference.com  #Frank  Chance,  Lou  Piniella,  Joe  Torre,  AL  Lopez,  Fred  Clarke  

Number  4    

In  which  years  were  there  tie  games?  and  how  many  were  there?  

dbGetQuery(db,  "SELECT  sum(ties),  yearID  AS  year  FROM  SeriesPost  WHERE  ties=1  GROUP  BY  yearID;")          

 sum(ties)  year  1                  1            1885  2                  1            1890  3                  1            1892  

Number  5  

 Who  are  all  of  the  pitchers  in  the  MLB  for  the  year  2013  and  which  team  did  they  play  for?  

pitchers=dbGetQuery(db,  "SELECT  teamID,  playerID,yearID,  Pos  FROM  FieldingPost  WHERE  yearID=2013  AND  Pos='P';")    

There  are  166  Pitchers,  so  I  will  just  list  the  first  5  and  the  last  5:  

           teamID    playerID  yearID  POS  1            DET  albural01      2013      P  2            DET  albural01      2013      P  3            CLE  allenco01      2013      P  4            DET  alvarjo02      2013      P  5            OAK  anderbr04      2013      P  ...      ...    ......            ...        ..  162        BOS  workmbr01      2013      P  163        BOS  workmbr01      2013      P  164        BOS  workmbr01      2013      P  165        TBA  wrighja01      2013      P  166        TBA  wrighwe01      2013      P