airline flights delay prediction- 2014 spring data mining project
Post on 15-Jul-2015
540 Views
Preview:
TRANSCRIPT
Flight Delay Prediction ModelVishwanath K, Viral Tarpara, Haozhe Wang, Ling Zhou
Business Problem Overview
Flight delay is a challenging problem for all airline companies, which will lead to
● Financial losses.
● Negative impact on their business reputation.
$32.9B
$8.3B $16.7B $3.9B $4B
Cost of Delays in the US
Cost to Airlines Cost to Passengers Cost from
Lost Demand
GDP ImpactSource: Total Cost Impact Study
Business Problem Overview
ModelPredict Flight Delay
Optimize operation
Reduce further loss
Airline
Companies
Help
Literature Review on Delay CostsAirline industry incurs an average cost of about $11,300 per delayed flight.
based on 61,000 delayed flights per month average
Excludes costs to passengers and lost demand
A more accurate delay prediction system can help to identify operational variables that contribute to delays.
While some conditions, such as weather, are not controllable factors, the way airlines and airports operate and optimize resources in the face of "acts of god" is controllable.
Data UnderstandingDataset: On-Time Performance
From Research and Innovative Technology Administration,BTS
Data UnderstandingPotentially Useful Variables:
Quarter,
Month;
Day of Month
Flight
Number
Origin Airport;
Destination Airport
Departure Block;
Arrival Block
Carrier
Departure Delay;
Arrival Delay
Time
OperationGeography
Airline
Training: Testing:
Data Preparation
Selected Attributes from 2012 Data
Derived Attributes from 2011 Data
Selected Attributes from 2013 Data
Derived Attributes from 2012 Data
Attributes from Additional Dataset Attributes from Additional Dataset
Data PreparationSelected Attributes:
1. Quarter
2. Month
3. Day of Month
4. FL_NUM: Flight Number
5. Origin: Origin Airport
6. Dest: Destination Airport
7. UniqueCarrier: Unique Carrier Code
8. DepTimeBLK: Departure Time Block, Hourly Intervals
9. ArrTimeBLK: Arrival Time Block, Hourly Intervals
Target: ArrDel: Arrival Delay, 1=Y, 0=N
Removed for the project.to build the full model these attributes
are necessary.
Data Preparation
Derived Attributes:
1. Airline_Delay: the percentage of delay by each airline in one year
2. Flight_Delay: the percentage of delay by each specific flight in one year
3. Day_Delay: the percentage of delay by day of month for all flights in one year
4. Origin_Delay: the percentage of delay by each origin airport for all flights in one year
5. Dest_Delay: the percentage of delay by each destination airport for all flights in one year
6. Dep_BLK_Delay: the percentage of delay by each departure block for all flights in one year
7. Arr_BLK_Delay: the percentage of delay by each arrival for all flights in one year
Data Preparation
Additional Dataset : Schedule EmployeesFrom Research and Innovative Technology Administration, BTS
Data Preparation
Additional Attributes:
1. Full Time Employees in current month
2. Part Time Employees in current month
3. FTE Employees: Full Time Equivalent Employees in current month
(2 part time= 1 full time)
4. Total Employees in current month
We wanted to see if historical on-time performance and current
staffing levels was enought to build a decent model.
Data Preparation
Large size of dataset(2.9GB)
Merge these attributes by month(via Excel Vlookup)
Use data of one month, January, to build the model.
Modeling
• Naive Beyes
• Decision tree- J48(with various leaf sizes)
• Logistic Regression “refused” to grocess in Weka
Modeling
Preprocess
• Convert the type of attributes
• Convert csv file to arff(70MB)
Training:
• Instances: 422539
• Attributes: 19
Testing:
• Instances: 478145
• Attributes: 19
NaiveBayes Modeling
On Training Data
Confusion Matrix of Naïve Bayes:
a b <-- classified as
333876 28289 | a = 0 (on-time)
45761 14613 | b = 1 (delay)
Accuracy ROC Area
Naïve Bayes 82.475% 0.694
High cost, lower is better
Modeling- snapshot
J48 with different parameter:
MinObjNum Accuracy ROC Area
15 88.4917% 0.85
25 87.7308% 0.791
50 87.3311% 0.774
100 87.0414% 0.767
150 82.475% 0.694
Modeling - snapshot
Confusion Matrix of J48, 25:
a b ---classified as
356570 5595 a=0
46247 14127 b=1
Confusion Matrix of J48, 15:
a b <-- classified as
356407 5758 | a = 0
42869 17505 | b = 1
Confusion Matrix of J48, 100:
a b ---classified as
357881 4284 a=0
50471 14127 b=1
Confusion Matrix of J48, 50:
a b ---classified as
356885 5280 | a = 0
48251 12123 | b = 1
Training model Performance-J48
0.7
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
15 25 50 100 150
ROC Area
ROC Area
The trend
levels off
around 0.76
Model Evaluation
oThe evaluation is mainly based on the falsely classified on-time instances: this is the
case where pessengers are given confidence on arrive on time while end up being late.
oWe choose trainning model with largest AUC and smallest False Nagative value.
MinObjNum Accuracy ROC Area FN Value Results
15 88.4917 % 0.85 42869 Reject
25 87.7308% 0.791 46247 Reject
50 87.3311% 0.774 48251 Reject
100 87.0414% 0.767 50471 Reject
150 86.8746 % 0.761 51276 Reject
NaiveBeyes 82.475% 0.694 45761 Accept
Model EvaluationModel Performance on Testing Data(Jan 2013)
Model ROC Area FN Value
J48_minObjNum=100 0.512 82442
Naive Bayes 0.583 74058
DeploymentExample : Avoiding the Most Delay Prone Parts of the System
Schedule your air flight without a layover
Avoid the major hubs by using smaller airportsChicago ORD, New York City (All), Atlanta were the worse in terms of congestion
Early Morning Departure flights have better on-time performance
Late Afternoon and early evening has the worst on-time performance
"When I can, I try to arrive the night
before," says Russell Hayward, a
USA TODAY Road Warrior. "But that eats
up a whole work day, wasted
travel time due to airline uncertainty."
(Woodyard, 2001)
top related