# Data Analysis Methods

Post on 19-Nov-2015

2 views

Embed Size (px)

DESCRIPTION

Analyzing the Flight DataTRANSCRIPT

<ul><li><p>DATA ANALYSIS METHODS Mid Term </p><p>Onkar Deshmukh M06156153 </p><p>deshmuop@mail.uc.edu </p><p>Abstract </p></li><li><p>1 </p><p>DATA ANALYSIS METHODS </p><p>Table of Contents 1. Purpose: ................................................................................................................................................ 2 </p><p>2. Understanding the data ........................................................................................................................ 2 </p><p>3. Data Cleansing....................................................................................................................................... 4 </p><p>4. Plots and Data Visualization .................................................................................................................. 5 </p><p>5. Correlation and Variable Selection ....................................................................................................... 6 </p><p>6. Effect of Variables on landing distance ................................................................................................. 8 </p><p>6.1 Effect of Aircraft make on landing distance: ................................................................................. 8 </p><p>6.2 Effect of speed_ground on landing distance: ............................................................................... 9 </p><p>7. Model Fitting and Regression Analysis ............................................................................................... 10 </p><p>7.1 Initial Model with all variables .................................................................................................... 10 </p><p>7.2 Model with only aircraftmodel, speed_ground and height ........................................................ 11 </p><p>7.3 Squaring the values for speed_ground and Improving Model....................................................... 13 </p><p>8. Summary ............................................................................................................................................. 15 </p></li><li><p>2 </p><p>DATA ANALYSIS METHODS </p><p>1. Purpose: Purpose of this project is to analyze given landing.csv data and study what factors and how they would </p><p>impact the landing distance. Also, come up with a best fit model which fits factors affecting landing </p><p>distance. </p><p>2. Understanding the data Objective: Need to understand the data given to us. This section will give us basic overview of the data. </p><p>What kind of values we have, how many variables we have and what are their summary level statistics </p><p>like mean, min and max. Histograms will help us understand value frequency distribution of these </p><p>variables. </p><p>R Code : > FlightData dim(FlightData) Output: [1] 800 8 </p><p>R Code </p><p>> summary(FlightData) </p><p>> attach(FlightData) </p><p>Output: </p><p>> par(mfrow= c(2,4)) > hist(duration) > hist(no_pasg) > hist(speed_ground) > hist(speed_air) > hist(height) > hist(pitch) > hist(distance) </p></li><li><p>3 </p><p>DATA ANALYSIS METHODS </p><p>Observations: </p><p> Given dataset contains 8 variables and 800 observations </p><p> Two types of aircrafts exist in the dataset </p><p> All the variables except for Speed_air are completely populated. Speed_air has 600 N/A values </p><p>which we might need to handle separately </p><p> Looking at minimum and maximum values of these variables, we realize that some of the </p><p>observations dont contain recommended values for Duration, Speed_ground, Speed_air, Height </p><p>and distance. </p><p>Decision: </p><p>We need to cleanse the data to choose the observations abiding by the rules given in the document and </p><p>filter out the observations that contain values beyond the threshold for that variable. </p></li><li><p>4 </p><p>DATA ANALYSIS METHODS </p><p>3. Data Cleansing Objective: As noted in previous section, data given in the file has few observations that are not in line </p><p>with the given recommended threshold values. For our analysis we need to clean up this data. Data </p><p>cleansing rules specified in the requirement document are: </p><p> Duration of a normal flight should be greater than 40 minutes </p><p> Speed_ground and Speed_air should not be less than 30MPH and greater than 140 MPH </p><p> Height should at least be 6 meters at the threshold of runway </p><p> Length of airport runway should be less than 6000 feet </p><p>R Code: </p><p>rule1 rule6 140) > rule7 rule8 6000) > rulei1 rulei2 rulei3 rulei4 rulei5 rulei6 FlightDataClean dim(FlightDataClean) > detach(FlightData) > attach(FlightDataClean) </p><p>R Output: </p><p>> dim(FlightDataClean) [1] 781 8 </p><p>Observations: </p><p> 19 observations have been filtered out. 5 observations have duration less than 40 minutes. 2 </p><p>observations have speed_ground less than 30 MPH and 1 observation for speed_ground is more </p><p>than 140 MPH. 10 observations have height greater than 6 meters. 1 observation has landing </p><p>distance greater than 6000 meters. </p><p>Conclusion: </p><p> We have cleaner data for our analysis </p></li><li><p>5 </p><p>DATA ANALYSIS METHODS </p><p>4. Plots and Data Visualization Objective: Graphically understand impact of different variables on landing distance. </p><p>Rcode: pairs(FlightDataClean) </p><p>Output: </p></li><li><p>6 </p><p>DATA ANALYSIS METHODS </p><p>Observation: </p><p> Speed_air and speed_ground seems to have a prominent graphical pattern </p><p> Other variables dont have a meaningful pattern </p><p>Conclusion: </p><p> Preliminary graphical analysis makes us believe that speed_air and speed_ground seem to have </p><p>an effect on landing distance </p><p>5. Correlation and Variable Selection Objective: Understand correlation between all the variables. Also, we need to select variables that can </p><p>be used in our linear model. If any of these variables are highly correlated then we can use either one of </p><p>these variables. Also, as described in chapter1, speed_air has too many missing values. We need to find </p><p>if we can use an alternate variable in place of speed_air. This variable should be highly correlated to </p><p>speed_air. </p><p>R Code: </p><p>> cor(FlightDataClean[,2:8]) > cor(FlightDataClean[,2:8],use = "pairwise.complete.obs") > plot(speed_ground,speed_air) > speed_diff = speed_air - speed_ground > summary(speed_diff) > hist(speed_diff) </p><p>R Output: </p></li><li><p>7 </p><p>DATA ANALYSIS METHODS </p></li><li><p>8 </p><p>DATA ANALYSIS METHODS </p><p>Observation: </p><p> Speed_ground and Speed_air have very strong positive correlation. </p><p> Because of 600 missing values in Speed_air the correlation of it with other variables cant be </p><p>determined. So we need to drop N/A values and then find the correlation </p><p> Speed_air and Speed_ground plot is a linear graph </p><p> Speed_air is N/A for values less than 90 MPH. Its populated only when its greater than 90 </p><p>MPH </p><p>Conclusion: </p><p>Missing values present in Speed_air is definitely an issue in data analysis. We cant just drop this </p><p>variable. However, Speed_ground has a correlation coefficient of .989 which means that we can use </p><p>Speed_ground as a substitute for Speed_air during our analysis. This will eliminate the issue as well as </p><p>we wont lose significant information. </p><p>6. Effect of Variables on landing distance </p><p>6.1 Effect of Aircraft make on landing distance: Objective: Understand effect of aircraft model on landing distance </p><p>R Code: > aircraftmodel aircraftmodel[which(aircraft == "boeing")] plot(distance~aircraftmodel) </p><p> Output: </p></li><li><p>9 </p><p>DATA ANALYSIS METHODS </p><p>Observation: Based on the given data, it seems that landing distance for boeing has an upward shift as </p><p>compared to airbus aircraft. </p><p>Conclusion: It seems that range of landing distance for boeing aircrafts is greater than the range of </p><p>landing distance for airbus model </p><p>6.2 Effect of speed_ground on landing distance: Objective: To understand effect of speed_ground on landing distance. </p><p>Rcode: </p><p>plot(distance~speed_ground) </p><p>Output: </p><p>Observation: </p><p>From the graphs and correlation, it can be concluded that speed_ground has an effect on landing </p><p>distance. Speed_ground seems to have a linear relationship with distance in the range of 80-120. In this </p><p>range, distance seems to increase linearly with speed_ground. For the range 40-80 it distance seems to </p><p>have a nonlinear relationship. </p><p>Conclusion: To explain nonlinear component in the graph, we can conclude that there is a quadratic </p><p>component needs to be involved while we are fitting a model to explain relationship between </p><p>speed_ground and distance. </p></li><li><p>10 </p><p>DATA ANALYSIS METHODS </p><p>7. Model Fitting and Regression Analysis </p><p>7.1 Initial Model with all variables Objective: Goal of this section is to define a model which will fit for all the variables present in the </p><p>cleaned up dataset. </p><p>Rcode: </p><p>> Model1 summary(Model1) </p><p>Observation: </p><p> Null hypothesis: Variables (regressors) have no impact on the response (landing distance). </p><p> P value for aircraftmodel, speed_ground and height is less than 0.05. This means that we have </p><p>95% confidence that we can reject null hypothesis. That means, it seems that aircraftmodel, </p><p>speed_ground and height may have an impact on the model that we are trying to fit. </p><p> In ideal scenario, if this value is 1 then the model that we are trying to come up fits given data </p><p>perfectly. R-squared value given here is .856 which is close to 1. </p></li><li><p>11 </p><p>DATA ANALYSIS METHODS </p><p>Conclusion: Based on above observations we can conclude that: </p><p> R-squared value of .856 indicates that 85.6% of the variability in landing distance is explained by </p><p>the variables and model that we have come up with. </p><p> We have 95% confidence and enough evidence to believe that aircraftmodel, speed_ground and </p><p>height dont have an impact on our model. So we need to consider effect of these variables </p><p>separately. We also need to monitor adjusted R-squared to decide if our model has any </p><p>improvement in explaining variability </p><p>7.2 Model with only aircraftmodel, speed_ground and height Objective: Objective here is to reduce number of variables from previously built model and analyze the </p><p>impact on goodness of the fit of this new model. For that we are going to consider only 3 variables: </p><p>aircraftmodel, speed_ground and height. Moreover, we also need to plot residuals for these 3 variables. </p><p>Rcode: </p><p>> Model2 summary(Model2) > Residuals1 par(mfrow=c(1,3)) > plot(Residuals1~aircraftmodel) > plot(Residuals1~speed_ground) > plot(Residuals1~height) </p><p>Output: </p></li><li><p>12 </p><p>DATA ANALYSIS METHODS </p><p>Observations: </p><p> Residual plot for speed_ground seems to have a nonlinear or quadratic pattern. It has a U-</p><p>shaped plot. </p><p> Because aircraftmodel has only two discrete values 0 and 1, we are still getting those as two </p><p>discrete residual values. No meaningful conclusion can be drawn at this point </p><p> Residual plot for height has random non-symmetric pattern, so meaningful conclusion is difficult </p><p>to be drawn. </p><p>Conclusion: </p><p> We need to improve our mode by improving nonlinear residual plot for speed_ground. To </p><p>incorporate nonlinearity shown in the curved graph, we need to include a nonlinear component </p><p>in our model so that we can better explain variability using nonlinear equation. </p><p> We can keep height and aircraftmodel variables as it is in the model. </p></li><li><p>13 </p><p>DATA ANALYSIS METHODS </p><p> 7.3 Squaring the values for speed_ground and Improving Model Objective: As described in previous model, we are going to square the values for speed_ground to </p><p>include nonlinear nature of the curve. We also need to monitor R-squared and Adjusted R-Squared </p><p>values for this new model </p><p>Rcode: </p><p>> speed_ground_sqr model3 summary(model3) > residuals2 par(mfrow=c(1,4)) > plot(residuals2~aircraftmodel) > plot(residuals2~height) > plot(residuals2~speed_ground) > plot(residuals2~speed_ground_sqr) </p></li><li><p>14 </p><p>DATA ANALYSIS METHODS </p><p>Observations: </p><p> R-Square and adjusted R-Square values have gone up. Now these values are 0.9776 each. </p><p> P-values for all the variables in the model are less than 0.05 </p><p> Residual plots for speed_ground and speed_ground_sqr are randomly distributed </p><p>Conclusion: </p><p> R-Square value of 0.9776 indicates that 97.76% of the variability in the landing distance data is </p><p>explained by the model that we have come up with </p><p> This model is the best choice amongst all the models that we discussed so far </p></li><li><p>15 </p><p>DATA ANALYSIS METHODS </p><p>8. Summary </p><p>Based on the analysis, we can conclude that: </p><p> speed_ground and speed_air are highly correlated and they both seem to have an impact on </p><p>landing distance </p><p> From the data and regression analysis, we cant reject probability of height having an impact on </p><p>landing distance </p><p> Referring to the plots, we can conclude that speed_ground has a strong relationship with </p><p>landing distance. Part of the graph points out a linear relationship and part of the graph </p><p>indicates nonlinear relationship. However, nonlinear and U-shaped residual plot for </p><p>speed_ground makes reinforces that there is a nonlinear or quadratic relationship between </p><p>speed_ground and landing distance. Hence, we need to incorporate nonlinear component in our </p><p>model to find most accurately fitting model </p><p> In the end, model that includes a squared term of speed_ground, has a very high R-Squared </p><p>value (0.9776) which means that the nonlinear model that we came up in section 7.3 is the </p><p>better fit than other models that we discussed and explains most of the variability in the landing </p><p>distance. </p></li></ul>