by Amir Ebrahimi
The analysis that follows make use of the the DOT’s quarterly airfare data from the Domestic Airline Consumer Airfare Report. The reports and airfare data can be found in original form at:
http://www.transportation.gov/policy/aviation-policy/domestic-airline-consumer-airfare-report
My goals for this analysis are primarily to have an informed perspective of air travel from actual data. Secondarily, I’m hoping that some practical facts emerge that any traveler could benefit from. I will perform univariate, bivariate, and multivariate analysis in order to achieve these goals. Finally, I will attempt to build a linear model that can account for relationships among the data.
Some records were missing the information for the low and large fare carriers. Others had erroneous fare data. Incomplete records are not that useful to me in the analysis and since the NA data is less than 1% of the overall data I am deciding to exclude these records from the analysis. The summary below shows the individual NA counts per variable.
## Year quarter nsmiles airport_1
## Min. :2001 Min. :1.000 Min. : 101.0 DAL :174
## 1st Qu.:2005 1st Qu.:1.000 1st Qu.: 840.2 MDW :120
## Median :2009 Median :2.000 Median :1194.0 JFK : 64
## Mean :2008 Mean :2.431 Mean :1181.6 EWR : 56
## 3rd Qu.:2012 3rd Qu.:3.000 3rd Qu.:1617.0 HOU : 51
## Max. :2014 Max. :4.000 Max. :2620.0 LGB : 49
## NA's :5 (Other):971
## airport_2 passengers fare carrier_lg
## MDW :180 Min. : 0.1000 Min. : 63.95 99 : 4
## LGB : 75 1st Qu.: 0.1099 1st Qu.: 216.05 CO : 3
## JFK : 66 Median : 0.2174 Median : 298.00 G4 : 2
## HOU : 64 Mean : 0.7028 Mean : 344.06 AA : 1
## SWF : 64 3rd Qu.: 0.3297 3rd Qu.: 409.22 US : 1
## PIE : 61 Max. :137.0000 Max. :2133.00 (Other): 0
## (Other):975 NA's :1474
## large_ms fare_lg carrier_low lf_ms
## Min. :0.0454 Min. : 95.76 99 : 4 Min. :1
## 1st Qu.:0.0670 1st Qu.:121.76 G4 : 2 1st Qu.:1
## Median :0.0800 Median :196.00 AA : 1 Median :1
## Mean :0.4656 Mean :256.97 3M : 0 Mean :1
## 3rd Qu.:1.0000 3rd Qu.:322.75 9K : 0 3rd Qu.:1
## Max. :1.0000 Max. :618.00 (Other): 0 Max. :1
## NA's :1478 NA's :1478 NA's :1478 NA's :1482
## fare_low
## Min. : 95.76
## 1st Qu.:102.64
## Median :109.52
## Mean :113.09
## 3rd Qu.:121.76
## Max. :134.00
## NA's :1482
## [1] "Omitting 1485 NA records (0.92% of all records)"
The variables in the dataset are:
## [1] "Year" "quarter" "nsmiles" "airport_1" "airport_2"
## [6] "passengers" "fare" "carrier_lg" "large_ms" "fare_lg"
## [11] "carrier_low" "lf_ms" "fare_low" "haul"
The structure of the dataset and some sample values are:
## Classes 'tbl_df' and 'data.frame': 160258 obs. of 14 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ quarter : num 1 1 1 1 1 1 1 1 1 1 ...
## $ nsmiles : int 1300 151 1308 1347 1351 160 1339 646 654 963 ...
## $ airport_1 : Factor w/ 295 levels "ABE","ABI","ABQ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ airport_2 : Factor w/ 299 levels "ATL","AUS","AVL",..: 48 51 54 64 106 113 115 160 195 278 ...
## $ passengers : num 0.889 9.444 41.667 0.222 2.222 ...
## $ fare : num 130 298 355 175 322 ...
## $ carrier_lg : Factor w/ 39 levels "3M","99","9K",..: 9 33 33 9 11 32 33 24 32 33 ...
## $ large_ms : num 0.75 0.965 0.421 1 0.7 ...
## $ fare_lg : num 149 304 337 175 358 ...
## $ carrier_low: Factor w/ 49 levels "3M","99","9K",..: 13 42 42 11 29 40 42 29 40 42 ...
## $ lf_ms : num 0.25 0.965 0.421 1 0.2 ...
## $ fare_low : num 74.5 304.1 337.4 174.5 129.3 ...
## $ haul : Factor w/ 2 levels "short","medium": 2 1 2 2 2 1 2 2 2 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1485] 33 95 194 266 270 427 430 432 441 579 ...
## .. ..- attr(*, "names")= chr [1:1485] "33" "95" "194" "266" ...
Airports and carriers are factored variables in the dataset. The full names for the airport and carriers can be looked up by code at IATA.
## $airport_1
## [1] "ABE" "ABI" "ABQ" "ABY" "ACK" "ACT" "ACV" "ACY" "AEX" "AGS" "ALB"
## [12] "ALO" "ALW" "AMA" "APF" "ART" "ASE" "ATL" "ATW" "AUS" "AUW" "AVL"
## [23] "AVP" "AZA" "AZO" "BDL" "BFL" "BGM" "BGR" "BHB" "BHM" "BIL" "BIS"
## [34] "BJI" "BKG" "BLI" "BLV" "BMI" "BNA" "BOI" "BOS" "BPT" "BQK" "BRO"
## [45] "BTM" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CEC" "CHA"
## [56] "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL" "CLT" "CMH" "CMI"
## [67] "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG" "CVG" "CWA" "CYS"
## [78] "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN" "DIK" "DLH"
## [89] "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "EKO" "ELM" "ELP"
## [100] "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA" "FLG"
## [111] "FLL" "FLO" "FNL" "FNT" "FOE" "FSD" "FSM" "FWA" "FYV" "GCC" "GEG"
## [122] "GFK" "GJT" "GNV" "GPT" "GRB" "GRK" "GRR" "GSO" "GSP" "GTF" "GTR"
## [133] "GUC" "GYY" "HDN" "HFD" "HGR" "HHH" "HLN" "HOB" "HOU" "HPN" "HRL"
## [144] "HSV" "HTS" "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "ILE" "ILG"
## [155] "ILM" "IND" "INL" "IPL" "IPT" "ISN" "ISP" "ITH" "JAC" "JAN" "JAX"
## [166] "JFK" "JLN" "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LCH" "LCK" "LEB"
## [177] "LEX" "LFT" "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWS" "LYH"
## [188] "MAF" "MBS" "MCI" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [199] "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH" "MOB"
## [210] "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY" "MYR"
## [221] "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL" "OTH"
## [232] "PAH" "PBG" "PBI" "PDX" "PFN" "PGV" "PHF" "PHL" "PHX" "PIA" "PIE"
## [243] "PIH" "PIT" "PLN" "PNS" "PSC" "PSM" "PSP" "PUW" "PVD" "PWM" "RAP"
## [254] "RDD" "RDM" "RDU" "RFD" "RIC" "RNO" "ROA" "ROC" "RST" "RSW" "SAF"
## [265] "SAN" "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SDF" "SEA" "SFO"
## [276] "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SNA" "SPI" "SRQ" "STL"
## [287] "SUN" "SWF" "SYR" "TLH" "TOL" "TPA" "TRI" "TYS" "VPS"
##
## $airport_2
## [1] "ATL" "AUS" "AVL" "AVP" "AZA" "BDL" "BFL" "BHM" "BIL" "BKG" "BLI"
## [12] "BMI" "BNA" "BOI" "BOS" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE"
## [23] "CAK" "CEC" "CHA" "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL"
## [34] "CLT" "CMH" "CMI" "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG"
## [45] "CVG" "CWA" "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN"
## [56] "DIK" "DLH" "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "ELM"
## [67] "ELP" "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA"
## [78] "FLG" "FLL" "FLO" "FNT" "FSD" "FSM" "FWA" "FYV" "GCC" "GCK" "GEG"
## [89] "GFK" "GJT" "GNV" "GPT" "GRB" "GRI" "GRK" "GRR" "GSO" "GSP" "GTF"
## [100] "GTR" "GUC" "HDN" "HFD" "HHH" "HLN" "HOU" "HPN" "HRL" "HSV" "HTS"
## [111] "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "IFP" "ILE" "ILG" "ILM"
## [122] "IND" "IPT" "ISN" "ISP" "ITH" "IYK" "JAC" "JAN" "JAX" "JFK" "JLN"
## [133] "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LBE" "LCH" "LEB" "LEX" "LFT"
## [144] "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWB" "LWS" "LYH" "MAF"
## [155] "MBS" "MCI" "MCN" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [166] "MGW" "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH"
## [177] "MOB" "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY"
## [188] "MYR" "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL"
## [199] "OTH" "OXR" "PAH" "PBG" "PBI" "PDX" "PFN" "PGD" "PGV" "PHF" "PHL"
## [210] "PHX" "PIA" "PIE" "PIH" "PIT" "PLN" "PNS" "PQI" "PSC" "PSM" "PSP"
## [221] "PUW" "PVC" "PVD" "PVU" "PWM" "RAP" "RDD" "RDM" "RDU" "RFD" "RHI"
## [232] "RIC" "RKD" "RKS" "RNO" "ROA" "ROC" "ROW" "RST" "RSW" "SAF" "SAN"
## [243] "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SCK" "SDF" "SEA" "SFB"
## [254] "SFO" "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SMX" "SNA" "SPI"
## [265] "SPS" "SRQ" "SSI" "STC" "STL" "STS" "SUN" "SUX" "SWF" "SYR" "TEX"
## [276] "TLH" "TOL" "TPA" "TRI" "TSS" "TTN" "TUL" "TUP" "TUS" "TVC" "TWF"
## [287] "TXK" "TYR" "TYS" "UIN" "VGT" "VLD" "VPS" "WAS" "WYS" "XNA" "YKM"
## [298] "YNG" "YUM"
##
## $carrier_lg
## [1] "3M" "99" "9K" "AA" "AS" "AX" "AZ" "B6" "CO" "DH" "DL" "E9" "F9" "FL"
## [15] "G4" "GQ" "HP" "JI" "KS" "LH" "N7" "NJ" "NK" "NW" "OS" "PN" "RP" "SY"
## [29] "TW" "TZ" "U5" "UA" "US" "VX" "WN" "XP" "YX" "YY" "ZV"
##
## $carrier_low
## [1] "3M" "99" "9K" "AA" "AQ" "AS" "AX" "AZ" "B6" "BA" "CO" "DH" "DL" "E9"
## [15] "EY" "F9" "FL" "G4" "GQ" "HP" "IB" "JI" "KS" "LH" "N7" "NH" "NJ" "NK"
## [29] "NW" "OH" "OO" "OS" "PN" "RP" "SN" "SY" "TW" "TZ" "U5" "UA" "UP" "US"
## [43] "VX" "WN" "WS" "YV" "YX" "YY" "ZV"
Here is a summary of all of the variables in the dataset:
## Year quarter nsmiles airport_1
## Min. :2001 Min. :1.00 Min. : 67 DCA : 7730
## 1st Qu.:2005 1st Qu.:1.00 1st Qu.: 619 DFW : 7608
## Median :2010 Median :2.00 Median : 962 EWR : 7213
## Mean :2009 Mean :2.47 Mean :1095 IAD : 5596
## 3rd Qu.:2012 3rd Qu.:3.00 3rd Qu.:1476 IAH : 5110
## Max. :2014 Max. :4.00 Max. :2783 JFK : 5071
## (Other):121930
## airport_2 passengers fare carrier_lg
## TPA : 8271 Min. : 0.100 Min. : 17.0 DL :37638
## ORD : 7902 1st Qu.: 7.609 1st Qu.: 184.3 AA :24327
## LGA : 5838 Median : 27.283 Median : 233.2 UA :23099
## MDW : 5660 Mean : 186.968 Mean : 242.0 US :20919
## JFK : 5183 3rd Qu.: 148.152 3rd Qu.: 288.9 WN :19847
## IAD : 4719 Max. :7021.087 Max. :2566.9 CO :13325
## (Other):122685 (Other):21103
## large_ms fare_lg carrier_low lf_ms
## Min. :0.1000 Min. : 17.0 DL :34342 Min. :0.0100
## 1st Qu.:0.5083 1st Qu.: 181.6 AA :28543 1st Qu.:0.2171
## Median :0.6772 Median : 231.5 UA :19720 Median :0.4894
## Mean :0.6883 Mean : 241.8 US :19502 Mean :0.5293
## 3rd Qu.:0.8965 3rd Qu.: 289.4 WN :16374 3rd Qu.:0.8762
## Max. :1.0000 Max. :2566.9 CO :12084 Max. :1.0000
## (Other):29693
## fare_low haul
## Min. : 12.0 short : 27889
## 1st Qu.: 167.3 medium:132369
## Median : 212.9
## Mean : 223.4
## 3rd Qu.: 265.7
## Max. :2566.9
##
Across all of the airport pair markets:
What is the shortest airport pair distance?
## Source: local data frame [1 x 3]
##
## airport_1 airport_2 nsmiles
## (fctr) (fctr) (int)
## 1 MKE ORD 67
I’m curious which airport pairs had the most passengers traveling (on average) across all quarters:
## Source: local data frame [7,824 x 3]
##
## airport_1 airport_2 mean_passengers
## (fctr) (fctr) (dbl)
## 1 LAX SFO 4853.820
## 2 JFK LAX 4450.665
## 3 LGA ORD 4016.727
## 4 FLL LGA 3635.271
## 5 ATL LGA 3553.825
## 6 JFK SFO 3130.356
## 7 LAX ORD 2823.389
## 8 LAS LAX 2815.348
## 9 FLL JFK 2769.823
## 10 EWR MCO 2764.043
## .. ... ... ...
Now, I’m wanting to see what the most traveled airport pairs were for any given quarter were:
## Source: local data frame [7,824 x 3]
##
## airport_1 airport_2 max_passengers
## (fctr) (fctr) (dbl)
## 1 JFK LAX 7021.087
## 2 FLL LGA 6134.340
## 3 LAX SFO 5829.239
## 4 LGA ORD 5783.587
## 5 JFK SFO 5165.326
## 6 FLL JFK 4974.610
## 7 BOS LGA 4563.407
## 8 LAS LAX 4527.000
## 9 ATL LGA 4494.505
## 10 JFK MCO 4246.630
## .. ... ... ...
What year year and quarter was the most traveled airport pair from?
## Source: local data frame [1 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare carrier_lg
## (dbl) (dbl) (int) (fctr) (fctr) (dbl) (dbl) (fctr)
## 1 2014 4 2510 JFK LAX 7021.087 412.7212 DL
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
## lf_ms (dbl), fare_low (dbl), haul (fctr)
I’m wondering now how many unique airports there are in the dataset:
## Source: local data frame [338 x 1]
##
## ap
## (chr)
## 1 YUM
## 2 YNG
## 3 YKM
## 4 XNA
## 5 WYS
## 6 WAS
## 7 VLD
## 8 UIN
## 9 TYR
## 10 TXK
## .. ...
So, there are 338 unique airports that exist in various airport pairs.
What are the top ten airports based on the cumulative count of passengers that traveled through them for the time period covered in the dataset?
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## (chr) (dbl) (dbl) (dbl)
## 1 ORD 3715830 339.7486 217.0218
## 2 LGA 3043005 281.5511 233.9431
## 3 DFW 2782758 269.6210 226.9230
## 4 EWR 2414071 226.1213 264.6953
## 5 JFK 2209363 215.4635 234.3288
## 6 TPA 2094884 243.4780 189.0277
## 7 LAX 2054506 549.7743 263.5185
## 8 DCA 1955809 187.9321 249.1613
## 9 MDW 1780883 190.6522 194.8835
## 10 IAH 1721053 181.3353 240.0298
## .. ... ... ... ...
What are the top ten airports based on the average count of passengers that traveled through them across all quarters?
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## (chr) (dbl) (dbl) (dbl)
## 1 ATL 1201812.8 1074.9668 192.5350
## 2 MCO 1128439.9 1068.5984 171.9714
## 3 DEN 874282.2 785.5186 202.4002
## 4 LAS 1231757.4 734.0628 196.4701
## 5 PHX 771706.4 633.5849 218.0178
## 6 SEA 643998.6 581.7512 244.1213
## 7 LAX 2054506.4 549.7743 263.5185
## 8 DTW 595011.0 549.4100 209.9226
## 9 MSP 591489.3 536.2550 234.4933
## 10 SAN 519371.6 511.1926 247.7528
## .. ... ... ... ...
I wonder what the most expensive airports to travel from are:
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## (chr) (dbl) (dbl) (dbl)
## 1 TSS 4.590513 0.1995875 616.5870
## 2 JRB 1.524128 0.2177326 516.5652
## 3 IYK 2960.440939 38.4472849 480.1068
## 4 DIK 80.115942 11.4451346 456.3364
## 5 ISN 394.069703 13.1356568 452.4509
## 6 SMX 227.339377 7.8392889 424.4812
## 7 SUN 3965.633287 18.0256059 377.8083
## 8 OXR 220.545458 6.3012988 372.9604
## 9 IDA 16632.282650 41.7896549 361.7234
## 10 ASE 32450.613572 49.9240209 359.4056
## .. ... ... ... ...
TSS is the East 34th Street Heliport in NY, which technically isn’t an airport, but is included in the DOT fare data. According to Wikipedia, 72% of the flights are airtaxi.
Moving on to other variables…
Looking at fare
, which is the price paid for one-way travel between an airport pair:
## 99%
## 493
The plot looks to have a good normal curve, but with a long tail. The 99th quantile of the dataset is just under 500, so it makes sense to set a limit on the x-axis to better view the histogram of fares
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 184.3 233.2 242.0 288.9 2567.0
I’m curious what the outliers look like (anything above the 99th quantile):
## Source: local data frame [1,600 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare
## (dbl) (dbl) (int) (fctr) (fctr) (dbl) (dbl)
## 1 2013 2 2583 HPN MRY 0.1098901 2566.860
## 2 2008 1 278 DAL XNA 0.1098901 2156.020
## 3 2011 4 2508 HPN MFR 0.1086957 1824.990
## 4 2008 2 1746 EGE JFK 0.2197802 1813.935
## 5 2003 3 1377 BGM EFD 0.1000000 1645.000
## 6 2007 2 283 JFK PHF 0.1098901 1311.990
## 7 2012 1 1372 DAL SBP 0.2197802 1308.550
## 8 2002 4 840 EFD GSP 0.1000000 1292.000
## 9 2011 1 1848 BUR MGM 0.1111111 1255.980
## 10 2014 1 2292 BUR EYW 0.1111111 1248.000
## .. ... ... ... ... ... ... ...
## Variables not shown: carrier_lg (fctr), large_ms (dbl), fare_lg (dbl),
## carrier_low (fctr), lf_ms (dbl), fare_low (dbl), haul (fctr)
Looks like most large fares involve large distances and I’ll want to look at the relationship between fares and distance when I get to bivariate analysis. I also notice that the average passengers per day is under 1 for those city pairs. Another interesting thing to look at is the market share for those city pairs. The DAL<->XNA pair seemed odd because of the low mileage, so I looked on Kayak to see what current prices were and they were all above $500.
I am curious about what the average fare is for airport pairs that have less than 1 person / day on average to see if it is high:
## Source: local data frame [1 x 1]
##
## avgfare
## (dbl)
## 1 297.2354
Surprisingly, it isn’t as high as I thought it would be, but does fall beyond the third quartile.
Next, I want to see a breakdown of fare
by haul
to see the fare distribution for short and medium flights:
## Source: local data frame [2 x 2]
##
## haul mean(fare)
## (fctr) (dbl)
## 1 short 195.8801
## 2 medium 251.7311
I can see from this graph that there are more medium flights than short flights in the dataset. It’s understable that the average fare for short flights would cost less than medium flights.
Looking at nsmiles
, which is the one-way distance between airport pairs:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67 619 962 1095 1476 2783
Mileage between airport pairs looks somewhat multimodal with heteroskedasticity or possibly a skewed normal distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
If I re-scale the x-axis (log10), then I can view the long tail better:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 7.609 27.280 187.000 148.200 7021.000
The mean is larger than the median in this right-skewed histogram.
Now, let’s see a breakdown of passengers
for small and medium flights to see how the feature is affected by flight length:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Source: local data frame [2 x 2]
##
## haul mean(passengers)
## (fctr) (dbl)
## 1 short 212.5549
## 2 medium 181.5771
I find it interesting to see that short haul trips have more passengers on average than medium trips (e.g. coast-to-coast travel).
Next, looking at fare_lg
, which is the largest fare carrier’s average fare for the quarter:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 181.6 231.5 241.8 289.4 2567.0
The largest fares seem to closely match the average fares, which makes me think that very few low fares are sold in quantity.
So, now let’s look at fare_low
, which are the lowest fare carrier’s average fare:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.0 167.4 212.9 223.4 265.7 2567.0
The distribution looks compressed, but similar to the large fare distribution, albeit with a lower mean and median.
Now, I’m curious about the difference between the average low and large fares:
## Source: local data frame [1 x 1]
##
## fare_diff
## (dbl)
## 1 18.37674
In terms of percentage of the average fare:
## Source: local data frame [1 x 1]
##
## mean(fare_diff_percentage)
## (dbl)
## 1 0.07012206
So, it looks like fares have a variance of 7% of the average fare between the average large and low fares.
Now, let me take a look at large_ms
, which is the large fare carrier’s market share:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1000 0.5083 0.6772 0.6883 0.8965 1.0000
Seems like there is a sizable count of carriers with the largest fare that have 100% market share. I’d like to compare that to the rest:
## Source: local data frame [2 x 2]
##
## large_ms == 1 n
## (lgl) (int)
## 1 FALSE 147608
## 2 TRUE 12650
## n
## 1 0.08569996
So, 8.5% of all of the fare data is from carriers that have 100% market share between two airport pairs.
What is the market share distribution for large fare carriers based on trip length?
The distribution for medium length trips closely resembles the overall distribution, which is indicative of a higher quantity of medium haul trips having taken place over the reporting period.
So, what percentage of trips are short haul trips overall?
## Source: local data frame [2 x 2]
##
## haul == "short" n
## (lgl) (int)
## 1 FALSE 132369
## 2 TRUE 27889
## n
## 1 0.2106913
I was expecting a much lower percentage than 21%, however, it does help round out the picture of the overall market. Knowing that 79% of airport pairs are medium haul distance (> 500 miles) is corroborated by the mean statistic of nsmiles
.
Now, let me take a look at lf_ms
, which is the low fare carrier’s market share:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2171 0.4894 0.5293 0.8762 1.0000
This last plot is interesting, since there is a spike at 10%, so I looked at the data to verify that it was good, which it is. The reason for this spike in the data is explained by the DOT:
Note that the “lowest fare carrier” is the carrier with the lowest average fare that has at least a 10 percent share of the traffic in the market, except for markets where only a single carrier has a 10 percent or greater share.
Let’s take a closer look at this plot and what the underlying data looks like:
## Source: local data frame [7,270 x 8]
##
## Year quarter nsmiles airport_1 airport_2 carrier_low lf_ms fare_low
## (dbl) (dbl) (int) (fctr) (fctr) (fctr) (dbl) (dbl)
## 1 2001 3 696 ABE MDW CO 0.0741 179
## 2 2001 3 655 ABE ORD DL 0.0144 198
## 3 2001 3 1270 ABQ ATL TW 0.0834 143
## 4 2001 3 744 ABQ IAH AA 0.0286 117
## 5 2001 3 677 ABQ LAX HP 0.0853 97
## 6 2001 3 719 ABQ MCI CO 0.0185 99
## 7 2001 3 1122 ABQ MDW AA 0.0152 146
## 8 2001 3 889 ABQ OAK HP 0.0725 119
## 9 2001 3 610 ABQ SAT AA 0.0714 138
## 10 2001 3 718 ALB MDW CO 0.0220 109
## .. ... ... ... ... ... ... ... ...
From the above data it looks like ABQ (Albuquerque International Sunport) is dominated by a single large carrier.
Since we looked at distribution for large fare carriers based on trip length, we might as well look at the same for low fare carriers:
Nothing really stands out from this plot as significant to me. The distributions look roughly the same for both short haul and medium haul flights.
Finally, I’m curious about what percentage of carriers are both the largest and lowest fare carrier:
## Source: local data frame [1 x 1]
##
## n()/nrow(af)
## (dbl)
## 1 0.5591921
So, 55.9% of carriers are both the largest and lowest fare carrier, which could mean many things. One possibility might be that a majority of carriers remain competitive by offering low fares in addition to being the largest fare carrier.
There are 160,258 records in the dataset with 13 features (Year, quarter, nsmiles, airport_1, airport_2, passengers, fare, carrier_lg, large_ms, fare_lg, carrier_low, lf_ms, and fare_low). The variables airport_1, airport_2, carrier_lg, and carrier_low are unordered factor variables.
Other observations:
The main features of interest to me in the dataset are fare, nsmiles, and passengers. I’d like to figure out what variables influence the fare. My guess is that distance between airports is a factor and that the activity between airports may have some weight.
Market share may have an impact on the average fare. My guess would be that areas that have high market share have less competition and may have higher fares.
At the time of my univariate analysis I had not created any variables. However, after performing bivariate analysis I thought it would be good to create an additional categorical variable, haul, for segmenting short and medium trips. A medium trip is anything over 500 miles. I then went back and performed additional univariate analysis.
Data cleaning was the bulk of the work during my investigation. I saw anomalies in market share, large and low fares, and erroneous data that needed to be marked as NA. Some variable naming changed across the years of the dataset and some fields were combined, so I had to fix those as well. I added the fixup code to the data loading functions. These changes were necessary in order to have consistent data to analyze.
Let’s start by taking a look at the correlation matrix across all features:
## Year quarter nsmiles passengers
## Year 1.00000000 -0.0387753123 0.127808813 0.0140409715
## quarter -0.03877531 1.0000000000 -0.003134517 0.0007013818
## nsmiles 0.12780881 -0.0031345172 1.000000000 -0.0144414386
## passengers 0.01404097 0.0007013818 -0.014441439 1.0000000000
## fare 0.31893655 -0.0370997782 0.457826614 -0.1829057014
## large_ms -0.02335850 -0.0056159138 -0.343139151 -0.0690767254
## fare_lg 0.29604732 -0.0357017595 0.431890715 -0.1551365111
## lf_ms 0.08091732 -0.0161899485 -0.236037321 -0.0919099302
## fare_low 0.32965613 -0.0215216567 0.376823089 -0.1788191433
## fare large_ms fare_lg lf_ms fare_low
## Year 0.31893655 -0.023358503 0.29604732 0.08091732 0.329656135
## quarter -0.03709978 -0.005615914 -0.03570176 -0.01618995 -0.021521657
## nsmiles 0.45782661 -0.343139151 0.43189072 -0.23603732 0.376823089
## passengers -0.18290570 -0.069076725 -0.15513651 -0.09190993 -0.178819143
## fare 1.00000000 -0.140722104 0.96032828 -0.07386059 0.905739991
## large_ms -0.14072210 1.000000000 -0.13615080 0.69085587 0.004520862
## fare_lg 0.96032828 -0.136150796 1.00000000 -0.11410075 0.873874571
## lf_ms -0.07386059 0.690855870 -0.11410075 1.00000000 0.081461978
## fare_low 0.90573999 0.004520862 0.87387457 0.08146198 1.000000000
Based on the correlation matrix it seems that it’s worth looking into mileage, year, and fare further:
##
## Pearson's product-moment correlation
##
## data: af$nsmiles and af$fare
## t = 206.15, df = 160260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4539482 0.4616877
## sample estimates:
## cor
## 0.4578266
##
## Pearson's product-moment correlation
##
## data: af$Year and af$fare
## t = 134.71, df = 160260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3145317 0.3233276
## sample estimates:
## cor
## 0.3189366
As predicted, there is a strong positive correlation between mileage and fare, but passenger count has a negligible relationship to price. Year has a moderate positive correlation with fare.
Let me perform a plot matrix of Year
, quarter
, fare
, nsmiles
, passengers
, carrier_lg
, large_ms
to get a sense of what the data looks like visually:
Based on the correlation values between nsmiles
and fare
, let’s take a look at that plot:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
The trend line (red), shows a positive correlation between mileage and fare. In other words, fare increases as mileage increases, which makes sense intuitively. The blue line shows the smoothing of the point data and the green line shows the low fare carrier’s average fare (smoothed as well).
Let’s also take a look at fare
as categorized by haul
(trip length):
## Source: local data frame [2 x 4]
##
## haul mean(fare) median(fare) IQR(fare)
## (fctr) (dbl) (dbl) (dbl)
## 1 short 195.8801 186.556 100.1681
## 2 medium 251.7311 241.890 101.9789
Interestingly enough, the interquartile range (IQR) is roughly the same between short and medium haul trips.
Let’s take a look at Year
and fare
:
From the look of the fare data plotted it looks like there was an increase in air travel for 2011-2013, which has a darker region. There’s also a average trend upward (dotted line) of fare increasing over time.
Now, let me take a look at nsmiles
and large_ms
:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
With the concentration of 100% market share carriers, it makes sense to separate out the 100% market share carriers and view a distribution:
So, 100% market share carriers seem to have bi-modal distribution primarily around 200 and 900 miles when looking at the density smoothing.
Continuing with a scatterplot of the rest of the large fare carriers not having 100% market share:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
It is interesting to see that carriers with the largest fare lose market share as the distance increases. My guess is that with the correlation between distance and fare that there is more competition across longer flights because there is more revenue to be gained.
Now, let’s do the same with low fare carriers by looking at nsmiles
and lf_ms
:
The low fare 100% market share carriers also seem to have bi-modal distribution around 200 and 900 miles.
Continuing with a scatterplot of the rest of the low fare carriers not having 100% market share:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
There is also a similar trend here where carriers lose market share as the flight distance increases, however, a little more significantly than large fare carriers.
Continuing with passengers
, what’s the trend across all of the years in the dataset?
Looks like 2012 was the best year for average passengers traveling, a drop in 2013, and a return in 2014.
I wonder what we can see if we plot the fare variances across all the large fare carriers:
It’s a busy plot, however I can see that United Airlines (UA) has the largest median fare. Let me double check this is the case by computing median statistics of the large fare for all large fare carriers (sorted high-to-low):
## Source: local data frame [38 x 2]
##
## carrier_lg median_fare_lg
## (fctr) (dbl)
## 1 UA 275.080
## 2 VX 266.665
## 3 9K 261.735
## 4 DL 251.130
## 5 AA 238.210
## 6 US 233.260
## 7 RP 232.330
## 8 HP 231.270
## 9 CO 219.090
## 10 NW 218.700
## .. ... ...
Might as well view the plot of the top ten large carriers (based on median fare):
Let me do the same for low fare carriers (skipping the large plot):
Cape Air (9K) has the largest fare among the low fare carriers. I’ll double check by similarly computing and sorting the median statistic for low fares across all low fare carriers:
## Source: local data frame [48 x 2]
##
## carrier_low median_fare_low
## (fctr) (dbl)
## 1 9K 253.240
## 2 UA 244.050
## 3 WS 239.930
## 4 RP 228.030
## 5 AA 227.290
## 6 DL 225.460
## 7 US 225.150
## 8 VX 209.785
## 9 AS 206.194
## 10 HP 203.910
## .. ... ...
After taking a look at the airlines, I’m curious what I might find if I plot the fares across the most traveled airports:
The dotted line is the overall median of fare
across all airport pairs. It’s interesting to see the average fares among the most traveled airports and see which ones are higher or lower than the median fare among the group.
So, which of the airports in the plot above have a median fare above the overall median fare?
## Source: local data frame [6 x 3]
##
## larger_ap median_fare diff
## (chr) (dbl) (dbl)
## 1 LAX 275.4786 42.290852
## 2 EWR 261.3781 28.190415
## 3 VPS 244.9723 11.784548
## 4 DCA 242.0185 8.830777
## 5 XNA 238.5907 5.402982
## 6 IAH 237.1551 3.967411
Well, it looks like if you’re in any of the cities above that you’re paying more than the median fare across all airports, but it also depends on where you’re going.Also, it’s worth mentioning that these airports serve plenty of markets, which could be a factor.
Finally, let’s take a first step towards building a linear model based on fare
and nsmiles
:
##
## Call:
## lm(formula = fare ~ nsmiles, data = af)
##
## Residuals:
## Min 1Q Median 3Q Max
## -252.76 -50.20 -9.17 38.68 2232.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.741e+02 3.798e-01 458.4 <2e-16 ***
## nsmiles 6.200e-02 3.008e-04 206.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.7 on 160256 degrees of freedom
## Multiple R-squared: 0.2096, Adjusted R-squared: 0.2096
## F-statistic: 4.25e+04 on 1 and 160256 DF, p-value: < 2.2e-16
Looks like we can account for 20.96% of the fare variance with this model.
Fare correlates strongly with the distance that is being traveled. There has also been a slow trend towards increasing prices since 2005, but it might be good to compare this to inflation to see if actual costs have increased. Passengers had a negligible relationship to fare.
Based on the \(R^2\) value, distance accounts for 20.96% of the variance in fare. Other features can be incorporated into the model to better explain the variance in price, which I will build on in multivariate analysis.
Yes, market share seems to drop when distance traveled increases. I think that is because there is more competition among longer distance flights, since there is more revenue to be gained.
The strongest relationship I found was between fare and distance.
To ease into our multivariate analysis, let’s first take a look at the ratio of fare
to nsmiles
(bivariate), which could be considered fare value or cost per mile for fares (split by trip length):
## Source: local data frame [1 x 1]
##
## short_haul_value
## (dbl)
## 1 0.6775934
## Source: local data frame [1 x 1]
##
## med_haul_value
## (dbl)
## 1 0.2301141
If you’re traveling under 500 miles, then you can expect to pay $0.68 per mile on average. If you’re traveling over 500 miles, then you can can expect to pay roughly $0.23 per mile on average.
Now, I want to see a better breakdown of fare value per distance traveled:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
##
## Formula: fare/nsmiles ~ A/sqrt(nsmiles)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## A 9.6914 0.0125 775.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1875 on 160257 degrees of freedom
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 6.06e-11
As you can see there is a lot of variance in cost per mile for travel under 500 miles. However, as a consumer if you travel larger distances you will eventually enjoy much better costs per mile.
In the plot above, the individual data points (colored in red or blueish-green) are overlayed with a smoothed conditional mean of the data in blue. The green and yellow lines are piece-wise linear fits for short and medium haul trips. Finally, the black line is a rough approximation of a non-linear fit that is essentially an inverse square root function.
With the above plot of fare value value being non-linear, it makes me want to revisit fare vs miles traveled (bivariate analysis) with different scales:
I now have a much better linear fit here. Let’s revisit the linear model and see if it improves:
##
## Call:
## lm(formula = I(log10(fare)) ~ I(sqrt(nsmiles)), data = af)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14930 -0.08221 0.00236 0.08421 1.08870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.119e+00 1.148e-03 1846.2 <2e-16 ***
## I(sqrt(nsmiles)) 7.545e-03 3.468e-05 217.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.133 on 160256 degrees of freedom
## Multiple R-squared: 0.2279, Adjusted R-squared: 0.2279
## F-statistic: 4.731e+04 on 1 and 160256 DF, p-value: < 2.2e-16
\(R^2\) went up by 1.83% with this change, so we can now better account for the relationship between fares and distance between airports.
Continuing with the concept of fare value, let’s take a look at the fare value across the largest airports (by activity):
## Source: local data frame [10 x 2]
##
## larger_ap median_value
## (chr) (dbl)
## 1 LAX 5.768249
## 2 TPA 5.688935
## 3 JFK 4.678094
## 4 LGA 4.536045
## 5 IAH 4.080295
## 6 MDW 4.064503
## 7 EWR 4.023544
## 8 DCA 3.921803
## 9 DFW 3.913508
## 10 ORD 3.585313
Here, we can see that LAX (Los Angeles Airport) fares enjoy the most bang for the buck (i.e. miles per dollar) among the larger airports. ORD (Chicago Airport) is among the weakest in terms of miles per dollar. You could consider this if you were moving to either of these two cities and planning to travel quite a bit.
I now want to see what the fare value is across large fare carriers:
While this is a fairly busy plot, there is one airline that sticks out among the rest: National Airlines (N7). Let’s look at some of the data for the airline:
## Source: local data frame [12 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare
## (dbl) (dbl) (int) (fctr) (fctr) (dbl) (dbl)
## 1 2002 1 1515 LAS ORD 2074.440 142.7000
## 2 2002 2 1515 LAS ORD 2441.640 139.4100
## 3 2002 3 2066 IAD LAS 320.970 209.5000
## 4 2002 3 1515 LAS ORD 2437.500 127.9900
## 5 2001 2 2248 JFK LAS 2223.626 169.5230
## 6 2001 2 1521 LAS MDW 1936.044 127.0379
## 7 2001 3 1055 DFW LAS 1474.000 138.0000
## 8 2001 3 2248 JFK LAS 2227.000 166.0000
## 9 2001 3 1521 LAS MDW 1628.000 123.0000
## 10 2001 3 2176 LAS MIA 684.000 158.0000
## 11 2001 3 2177 LAS PHL 1297.000 156.0000
## 12 2001 3 414 LAS SFO 2340.000 77.0000
## Variables not shown: carrier_lg (fctr), large_ms (dbl), fare_lg (dbl),
## carrier_low (fctr), lf_ms (dbl), fare_low (dbl), haul (fctr)
National Airlines gave the best mileage per dollar fares between 2001-2002, but it might explain why they are no longer around, too.
Let’s zoom in on the fare value among the most popular carriers:
## Source: local data frame [7 x 2]
##
## carrier_lg median_value
## (fctr) (dbl)
## 1 VX 7.246457
## 2 B6 6.754951
## 3 WN 5.078585
## 4 AA 4.472823
## 5 UA 4.424600
## 6 DL 3.969786
## 7 US 3.059363
Historically, it looks like Virgin America and JetBlue Airways are the two airlines that provide the best value for mileage per dollar. If you’re traveling long distances across country, it’s likely you’ll find good fares with those two airlines.
Let’s take a look at the fare value over the years (bivariate):
Over the years we have seen an increasing cost per mile traveled, but currently we’re experience a downward trend in cost per mile.
So, revisiting our linear model of log10(fare) ~ sqrt(nsmiles) + additional variables, let’s take a look at residuals:
## R-squared for model: 0.556
## [1] 0.1008139
## Don't know how to automatically pick scale for object of type AsIs. Defaulting to continuous
The individual residual data is plotted as points. The blue line is the standard deviation of all residuals, which is 0.1. The black line plot is the standard deviation at each reference mileage (or rather sqrt of mileage).
With an understanding of the strong relationship between distance and fare it made sense to me to look at other variables from that perspective. After seeing the relationship of fare value (cost per mile) to distance I thought to improve on the linear model that I had started in bivariate analysis.
The cost per mile graph was surprising in that the cost can go from $1 / mi at around a 250 mile distance to less than $0.40 / mi at around a 500 mile distance. The additional analysis into worst and best value airports and carriers was interesting to me.
Yes, I created a linear model using the log10 of fare and the square root of nsmiles. Other variables included are Year and airport pair (i.e. airport_1 and airport_2).
The variables of the linear model account for 55.6% of the variance in airline fares.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: position_dodge requires constant width: output may be incorrect
One interesting aspect of travel in general is the cost per mile for a transport mode. The cost per mile for air travel has fluctated over time and was the worst for a 14 year span (2001-2014) in 2008. Currently, we are experiencing a downward trend in cost per mile traveled as of 2014 Q4 with an inflection point in 2013. The blue line depicts the average cost per mile on a quarterly interval time series, which you can see fluctuates from quarter to quarter. The orange line is a smoothing of the same data to show trends more easily. The boxplot underneath shows the variance of cost per mile segmented yearly with the box representing the interquartile range (IQR) and the median by the horizontal bar within the plot. The median cost per mile was also the highest in 2008.
Mileage per dollar spent can be a metric for value. If you consider the mileage per dollar for airports, then you might have a sense of which airports offer good value for flying to or from, especially if you have a business where travel is frequent and you would like to minimize costs. If you consider the mileage per dollar for popular airlines, then you might have a sense of which airlines generally are going to have the best prices and ultimately, which frequent flyer programs or credit cards to join / acquire.
The plot of mileage per dollar across the ten largest airports (by activity) shows that Los Angeles International Airport (LAX) and Tampa International Airport (TPA) have the best median mileage per dollar spent. O’Hare International Airport (ORD) has the worst median mileage per dollar.
The plot of mileage per dollar across the most popular airlines shows that Virgin America and JetBlue Airways provide the best value in terms of how far you can travel per dollar. US Airways (now owned by American Airlines) is the worst in terms of mileage per dollar.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Another way of considering value with air travel is how much the cost per mile is for a given trip length. If you consider the value based on traveled miles, then you might travel farther to get the most travel for your money.
The plot of fare value (i.e. cost per mile) with one-way distance between airports roughly follows an inverse square root function. The cost per mile has high variance for travel up to 500 miles as you can see values can vary (roughly) between $1.25 (and beyond) to $0.38 after smoothing. For travel above 500 miles, the variance roughly sits between $0.38 and $0.12 after smoothing. It’s interesting to see that the distinction between short and medium haul trips by the air travel industry at 500 miles matches a “corner turn” in terms of fare value as well.
The DOT Domestic Airline Consumer Airfare Report has quarterly airfare data for 1996-2014. Unfortunately, only 2001-2014 had complete and usable data, so that is what was used by me to amass the 160,258 airport pair records. I started by simply cleaning and aggregating the data. Then, I explored single, double, and multiple variables in various ways to better understand the data. I was surprised that passenger activity did not have much weight on the predictive model of fare, but indirectly that may have been captured in the airport pair. The only strong positive correlation with fare was distance between airports, which was the basis of the linear model created. A secondary influence was Year, since there has been a general trend of increasing prices over time. The linear model improved once I realized that converting a non-linear relationship to a linear relationship (log10(fare) ~ sqrt(nsmiles) would better account for variance. Some limitations with this model are that it is averaged data instead of individual fares and that there are missing factors, such as how many days before the flight that the fare was purchased. If I had powerful computing at my disposal I would want to perform a similar analysis to this one on individual fare data.
Some interesting observations that I made while performing this analysis: