Predicting Flight Delays

Download Predicting Flight Delays

Post on 16-Aug-2015



Data & Analytics

1 download

Embed Size (px)


<ol><li> 1. How long would I have to wait for my flight??? 1 </li><li> 2. Objective Predict flight delays Flight delays are a major problem within the airline industry as it leads to financial losses and negative impact on business reputation. Examine causes for flight delays This project explores what factors influence the occurrence of flight delays. 2 Model Flight Delay Airline Companies Optimize Operations Reduce further loss Predict Help </li><li> 3. 3 Understanding the Data Variable Selection Sub-Setting data Replacing Missing Values Remove Outliers Re-creating variables KNN C5.0 MisclassificationSummary Project flow map </li><li> 4. 7 million Records 303 Origins 303 Destinations 20 Carriers 4 29 Variables </li><li> 5. Understanding the Data Summary Statistics Arrival Delay (ArrDelay) NAS Delay Distance Correlation 5 </li><li> 6. Variable Selection 29 Variables Flight Details &gt;Arrival time &gt;Departure time &gt;Origin &gt;Destination &gt; Month, Week, Day &gt; Unique Carrier &gt; Flight Num &gt;Tail Num &gt;Year Other information &gt;Delay minutes &gt;Weather delay &gt;NAS delay &gt; Unique Carrier Delay &gt; Late Aircraft Delay &gt; Security Delay &gt;Actual arrival time &gt;Taxi In &gt;Taxi Out &gt; Departure Delay &gt; CRS Elapsed Time 6 We took expert advice from Prof Dehnad </li><li> 7. TAKING CHARGE OF DATA 7 </li><li> 8. Step 1 - Subsetting Busiest Origins Top Carriers by Origin Carrier Hubs American Airlines Dallas Delta Atlanta United Chicago 8 </li><li> 9. Step 2 Replacing Missing Values % of Flights Diverted Dropped all records Where Diverted=1 Missing Values present when: Flights Diverted Flights Cancelled Cancelled No. of flights 0 6,872,294 1 137,434 Dropped all record Where Cancelled=1 Replaced NAs with 0 for Below columns: -CarrierDelay -SecurityDelay -LateArrivalDelay -NASDelay -WeatherDelay Final Record Count = 479,281 9 </li><li> 10. Step 3 Outliers Box Plot from R Distance based on Origin Destinations with Large Distance: HNL = Honolulu , Hawaii OGG = Kahului, Hawaii ANC = Alaska Distance 10 </li><li> 11. Arrival Delay Box Plot from R Delays by Month 11 </li><li> 12. Misclassifications: If ArrDelay Ran for more than 6 hours. Used For loop and ran for k=1:20 overnight. 17 </li><li> 18. KNN Result For loop : least error rate when k =17 Error rate decreased from 11.75%(k=1)to 10.69% (k=17) Increased from 10.69% to 10.76%(k=20) 18 0.1069235 0.1 0.105 0.11 0.115 0.12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Misclassification rate K value </li><li> 19. Classification Tree Which variables factors impact the DELAYs the most? Is KNN the best method to predict delays? 19 </li><li> 20. Classification Tree C5.0 Split is not Binary. The tree created is more bushy CART Splits categorical variables in binary splits 20 Tree is less bushy </li><li> 21. First output of decision tree 159 Nodes Error Rate- 9.7% Attribute Usage: 100.00% Weather Delay 100.00% NAS Delay 99.88% Late Aircraft Delay 93.29% Carrier Delay 18.25% Unique Carrier 15.79% Origin 10.90% Distance 10.29% CRS Arr Time 7.89% CRS Dep Time 3.88% Weather 3.80% Dest 3.11% holiday 21 </li><li> 22. Tree after pruning Pruning methodology used: C5.0Control (minCases = 3000) What variables are important: Error rates increases from 9.7% to 10.8% Number of nodes falls from 159 to 12 Variable Usage Variable Usage NAS Delay 100.00% Unique Carrier 7.39% Late Aircraft Delay 98.28% Origin 6.56% Carrier Delay 81.47% CRS Dep Time 5.29% Weather Delay 7.49% CRS Arr Time 3.89% 22 </li><li> 23. Tree after pruning 23 </li><li> 24. Our learnings from the tree 1 Flights tend to face higher delay time in evening (After 4:30PM) 2 Carrier delays in UA are likely to cause high delays 3 Flights from Chicago (ORD) is more likely to face long delays due to aircraft and NAS delays 24 </li><li> 25. Misclassification Method Other details Classification error Prediction error KNN K =1 NA 11.75% K = 17 NA 10.69% K = 20 NA 10.76% C5.0 All variables Without pruning 9.7% 10.0% C5.0 All variables With pruning 10.8% 10.7% 25 </li><li> 26. Re-run KNN and C5.0 without delay columns 1 Nodes Misclassification Rate- 30.6% C5.0 26 Knn At K up to 20 Misclassification Rate-29.18% 0 20 40 60 80 knn Prediction error </li><li> 27. Management Summary 27 Fly during the first half of the day to avoid delays Shorter flights ~ Longer delays Higher delays on Friday Highest delays in summer AA DL UA Percentage of flights ATL DFW ORD FOR PASSENGERS FOR CARRIERS ORD DFW High Delay Low Delay AA UA AA UA UA should shift its hub from Chicago to other origins to reduce the number of delays </li><li> 28. Issues/ Next Steps 28 Try Random Forest to improve predictions Make predictions without variables- reason for delays Data is Biased~ Almost 70% flights arrive on time Running with large dataset is limited by the processing capability of our laptops. Hence need to divide the data. For large dataset, we increased the heap size in R </li><li> 29. Appendix- Data, packages and software Dataset Softwares: R- studio Tableau Packages: Corrplot Class C50 Rpart 29 </li><li> 30. Appendix R codes 30 </li><li> 31. 31 </li></ol>