airline flights delay prediction- 2014 spring data mining project

Post on 15-Jul-2015

540 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Flight Delay Prediction ModelVishwanath K, Viral Tarpara, Haozhe Wang, Ling Zhou

Business Problem Overview

Flight delay is a challenging problem for all airline companies, which will lead to

● Financial losses.

● Negative impact on their business reputation.

$32.9B

$8.3B $16.7B $3.9B $4B

Cost of Delays in the US

Cost to Airlines Cost to Passengers Cost from

Lost Demand

GDP ImpactSource: Total Cost Impact Study

Business Problem Overview

ModelPredict Flight Delay

Optimize operation

Reduce further loss

Airline

Companies

Help

Literature Review on Delay CostsAirline industry incurs an average cost of about $11,300 per delayed flight.

based on 61,000 delayed flights per month average

Excludes costs to passengers and lost demand

A more accurate delay prediction system can help to identify operational variables that contribute to delays.

While some conditions, such as weather, are not controllable factors, the way airlines and airports operate and optimize resources in the face of "acts of god" is controllable.

Data UnderstandingDataset: On-Time Performance

From Research and Innovative Technology Administration,BTS

Data UnderstandingPotentially Useful Variables:

Quarter,

Month;

Day of Month

Flight

Number

Origin Airport;

Destination Airport

Departure Block;

Arrival Block

Carrier

Departure Delay;

Arrival Delay

Time

OperationGeography

Airline

Training: Testing:

Data Preparation

Selected Attributes from 2012 Data

Derived Attributes from 2011 Data

Selected Attributes from 2013 Data

Derived Attributes from 2012 Data

Attributes from Additional Dataset Attributes from Additional Dataset

Data PreparationSelected Attributes:

1. Quarter

2. Month

3. Day of Month

4. FL_NUM: Flight Number

5. Origin: Origin Airport

6. Dest: Destination Airport

7. UniqueCarrier: Unique Carrier Code

8. DepTimeBLK: Departure Time Block, Hourly Intervals

9. ArrTimeBLK: Arrival Time Block, Hourly Intervals

Target: ArrDel: Arrival Delay, 1=Y, 0=N

Removed for the project.to build the full model these attributes

are necessary.

Data Preparation

Derived Attributes:

1. Airline_Delay: the percentage of delay by each airline in one year

2. Flight_Delay: the percentage of delay by each specific flight in one year

3. Day_Delay: the percentage of delay by day of month for all flights in one year

4. Origin_Delay: the percentage of delay by each origin airport for all flights in one year

5. Dest_Delay: the percentage of delay by each destination airport for all flights in one year

6. Dep_BLK_Delay: the percentage of delay by each departure block for all flights in one year

7. Arr_BLK_Delay: the percentage of delay by each arrival for all flights in one year

Data Preparation

Additional Dataset : Schedule EmployeesFrom Research and Innovative Technology Administration, BTS

Data Preparation

Additional Attributes:

1. Full Time Employees in current month

2. Part Time Employees in current month

3. FTE Employees: Full Time Equivalent Employees in current month

(2 part time= 1 full time)

4. Total Employees in current month

We wanted to see if historical on-time performance and current

staffing levels was enought to build a decent model.

Data Preparation

Large size of dataset(2.9GB)

Merge these attributes by month(via Excel Vlookup)

Use data of one month, January, to build the model.

Modeling

• Naive Beyes

• Decision tree- J48(with various leaf sizes)

• Logistic Regression “refused” to grocess in Weka

Modeling

Preprocess

• Convert the type of attributes

• Convert csv file to arff(70MB)

Training:

• Instances: 422539

• Attributes: 19

Testing:

• Instances: 478145

• Attributes: 19

NaiveBayes Modeling

On Training Data

Confusion Matrix of Naïve Bayes:

a b <-- classified as

333876 28289 | a = 0 (on-time)

45761 14613 | b = 1 (delay)

Accuracy ROC Area

Naïve Bayes 82.475% 0.694

High cost, lower is better

Modeling- snapshot

J48 with different parameter:

MinObjNum Accuracy ROC Area

15 88.4917% 0.85

25 87.7308% 0.791

50 87.3311% 0.774

100 87.0414% 0.767

150 82.475% 0.694

Modeling - snapshot

Confusion Matrix of J48, 25:

a b ---classified as

356570 5595 a=0

46247 14127 b=1

Confusion Matrix of J48, 15:

a b <-- classified as

356407 5758 | a = 0

42869 17505 | b = 1

Confusion Matrix of J48, 100:

a b ---classified as

357881 4284 a=0

50471 14127 b=1

Confusion Matrix of J48, 50:

a b ---classified as

356885 5280 | a = 0

48251 12123 | b = 1

Training model Performance-J48

0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

15 25 50 100 150

ROC Area

ROC Area

The trend

levels off

around 0.76

Model Evaluation

oThe evaluation is mainly based on the falsely classified on-time instances: this is the

case where pessengers are given confidence on arrive on time while end up being late.

oWe choose trainning model with largest AUC and smallest False Nagative value.

MinObjNum Accuracy ROC Area FN Value Results

15 88.4917 % 0.85 42869 Reject

25 87.7308% 0.791 46247 Reject

50 87.3311% 0.774 48251 Reject

100 87.0414% 0.767 50471 Reject

150 86.8746 % 0.761 51276 Reject

NaiveBeyes 82.475% 0.694 45761 Accept

Model EvaluationModel Performance on Testing Data(Jan 2013)

Model ROC Area FN Value

J48_minObjNum=100 0.512 82442

Naive Bayes 0.583 74058

DeploymentExample : Avoiding the Most Delay Prone Parts of the System

Schedule your air flight without a layover

Avoid the major hubs by using smaller airportsChicago ORD, New York City (All), Atlanta were the worse in terms of congestion

Early Morning Departure flights have better on-time performance

Late Afternoon and early evening has the worst on-time performance

"When I can, I try to arrive the night

before," says Russell Hayward, a

USA TODAY Road Warrior. "But that eats

up a whole work day, wasted

travel time due to airline uncertainty."

(Woodyard, 2001)

top related