Airfare Exploratory Data Analysis (EDA)

by Amir Ebrahimi

The analysis that follows make use of the the DOT’s quarterly airfare data from the Domestic Airline Consumer Airfare Report. The reports and airfare data can be found in original form at:

http://www.transportation.gov/policy/aviation-policy/domestic-airline-consumer-airfare-report

My goals for this analysis are primarily to have an informed perspective of air travel from actual data. Secondarily, I’m hoping that some practical facts emerge that any traveler could benefit from. I will perform univariate, bivariate, and multivariate analysis in order to achieve these goals. Finally, I will attempt to build a linear model that can account for relationships among the data.

Some records were missing the information for the low and large fare carriers. Others had erroneous fare data. Incomplete records are not that useful to me in the analysis and since the NA data is less than 1% of the overall data I am deciding to exclude these records from the analysis. The summary below shows the individual NA counts per variable.

##       Year         quarter         nsmiles         airport_1  
##  Min.   :2001   Min.   :1.000   Min.   : 101.0   DAL    :174  
##  1st Qu.:2005   1st Qu.:1.000   1st Qu.: 840.2   MDW    :120  
##  Median :2009   Median :2.000   Median :1194.0   JFK    : 64  
##  Mean   :2008   Mean   :2.431   Mean   :1181.6   EWR    : 56  
##  3rd Qu.:2012   3rd Qu.:3.000   3rd Qu.:1617.0   HOU    : 51  
##  Max.   :2014   Max.   :4.000   Max.   :2620.0   LGB    : 49  
##                                 NA's   :5        (Other):971  
##    airport_2     passengers            fare           carrier_lg  
##  MDW    :180   Min.   :  0.1000   Min.   :  63.95   99     :   4  
##  LGB    : 75   1st Qu.:  0.1099   1st Qu.: 216.05   CO     :   3  
##  JFK    : 66   Median :  0.2174   Median : 298.00   G4     :   2  
##  HOU    : 64   Mean   :  0.7028   Mean   : 344.06   AA     :   1  
##  SWF    : 64   3rd Qu.:  0.3297   3rd Qu.: 409.22   US     :   1  
##  PIE    : 61   Max.   :137.0000   Max.   :2133.00   (Other):   0  
##  (Other):975                                        NA's   :1474  
##     large_ms         fare_lg        carrier_low       lf_ms     
##  Min.   :0.0454   Min.   : 95.76   99     :   4   Min.   :1     
##  1st Qu.:0.0670   1st Qu.:121.76   G4     :   2   1st Qu.:1     
##  Median :0.0800   Median :196.00   AA     :   1   Median :1     
##  Mean   :0.4656   Mean   :256.97   3M     :   0   Mean   :1     
##  3rd Qu.:1.0000   3rd Qu.:322.75   9K     :   0   3rd Qu.:1     
##  Max.   :1.0000   Max.   :618.00   (Other):   0   Max.   :1     
##  NA's   :1478     NA's   :1478     NA's   :1478   NA's   :1482  
##     fare_low     
##  Min.   : 95.76  
##  1st Qu.:102.64  
##  Median :109.52  
##  Mean   :113.09  
##  3rd Qu.:121.76  
##  Max.   :134.00  
##  NA's   :1482
## [1] "Omitting 1485 NA records (0.92% of all records)"

Univariate Plots Section

The variables in the dataset are:

##  [1] "Year"        "quarter"     "nsmiles"     "airport_1"   "airport_2"  
##  [6] "passengers"  "fare"        "carrier_lg"  "large_ms"    "fare_lg"    
## [11] "carrier_low" "lf_ms"       "fare_low"    "haul"

The structure of the dataset and some sample values are:

## Classes 'tbl_df' and 'data.frame':   160258 obs. of  14 variables:
##  $ Year       : num  2001 2001 2001 2001 2001 ...
##  $ quarter    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ nsmiles    : int  1300 151 1308 1347 1351 160 1339 646 654 963 ...
##  $ airport_1  : Factor w/ 295 levels "ABE","ABI","ABQ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ airport_2  : Factor w/ 299 levels "ATL","AUS","AVL",..: 48 51 54 64 106 113 115 160 195 278 ...
##  $ passengers : num  0.889 9.444 41.667 0.222 2.222 ...
##  $ fare       : num  130 298 355 175 322 ...
##  $ carrier_lg : Factor w/ 39 levels "3M","99","9K",..: 9 33 33 9 11 32 33 24 32 33 ...
##  $ large_ms   : num  0.75 0.965 0.421 1 0.7 ...
##  $ fare_lg    : num  149 304 337 175 358 ...
##  $ carrier_low: Factor w/ 49 levels "3M","99","9K",..: 13 42 42 11 29 40 42 29 40 42 ...
##  $ lf_ms      : num  0.25 0.965 0.421 1 0.2 ...
##  $ fare_low   : num  74.5 304.1 337.4 174.5 129.3 ...
##  $ haul       : Factor w/ 2 levels "short","medium": 2 1 2 2 2 1 2 2 2 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1485] 33 95 194 266 270 427 430 432 441 579 ...
##   .. ..- attr(*, "names")= chr [1:1485] "33" "95" "194" "266" ...

Airports and carriers are factored variables in the dataset. The full names for the airport and carriers can be looked up by code at IATA.

## $airport_1
##   [1] "ABE" "ABI" "ABQ" "ABY" "ACK" "ACT" "ACV" "ACY" "AEX" "AGS" "ALB"
##  [12] "ALO" "ALW" "AMA" "APF" "ART" "ASE" "ATL" "ATW" "AUS" "AUW" "AVL"
##  [23] "AVP" "AZA" "AZO" "BDL" "BFL" "BGM" "BGR" "BHB" "BHM" "BIL" "BIS"
##  [34] "BJI" "BKG" "BLI" "BLV" "BMI" "BNA" "BOI" "BOS" "BPT" "BQK" "BRO"
##  [45] "BTM" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CEC" "CHA"
##  [56] "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL" "CLT" "CMH" "CMI"
##  [67] "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG" "CVG" "CWA" "CYS"
##  [78] "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN" "DIK" "DLH"
##  [89] "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "EKO" "ELM" "ELP"
## [100] "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA" "FLG"
## [111] "FLL" "FLO" "FNL" "FNT" "FOE" "FSD" "FSM" "FWA" "FYV" "GCC" "GEG"
## [122] "GFK" "GJT" "GNV" "GPT" "GRB" "GRK" "GRR" "GSO" "GSP" "GTF" "GTR"
## [133] "GUC" "GYY" "HDN" "HFD" "HGR" "HHH" "HLN" "HOB" "HOU" "HPN" "HRL"
## [144] "HSV" "HTS" "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "ILE" "ILG"
## [155] "ILM" "IND" "INL" "IPL" "IPT" "ISN" "ISP" "ITH" "JAC" "JAN" "JAX"
## [166] "JFK" "JLN" "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LCH" "LCK" "LEB"
## [177] "LEX" "LFT" "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWS" "LYH"
## [188] "MAF" "MBS" "MCI" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [199] "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH" "MOB"
## [210] "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY" "MYR"
## [221] "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL" "OTH"
## [232] "PAH" "PBG" "PBI" "PDX" "PFN" "PGV" "PHF" "PHL" "PHX" "PIA" "PIE"
## [243] "PIH" "PIT" "PLN" "PNS" "PSC" "PSM" "PSP" "PUW" "PVD" "PWM" "RAP"
## [254] "RDD" "RDM" "RDU" "RFD" "RIC" "RNO" "ROA" "ROC" "RST" "RSW" "SAF"
## [265] "SAN" "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SDF" "SEA" "SFO"
## [276] "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SNA" "SPI" "SRQ" "STL"
## [287] "SUN" "SWF" "SYR" "TLH" "TOL" "TPA" "TRI" "TYS" "VPS"
## 
## $airport_2
##   [1] "ATL" "AUS" "AVL" "AVP" "AZA" "BDL" "BFL" "BHM" "BIL" "BKG" "BLI"
##  [12] "BMI" "BNA" "BOI" "BOS" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE"
##  [23] "CAK" "CEC" "CHA" "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL"
##  [34] "CLT" "CMH" "CMI" "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG"
##  [45] "CVG" "CWA" "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN"
##  [56] "DIK" "DLH" "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "ELM"
##  [67] "ELP" "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA"
##  [78] "FLG" "FLL" "FLO" "FNT" "FSD" "FSM" "FWA" "FYV" "GCC" "GCK" "GEG"
##  [89] "GFK" "GJT" "GNV" "GPT" "GRB" "GRI" "GRK" "GRR" "GSO" "GSP" "GTF"
## [100] "GTR" "GUC" "HDN" "HFD" "HHH" "HLN" "HOU" "HPN" "HRL" "HSV" "HTS"
## [111] "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "IFP" "ILE" "ILG" "ILM"
## [122] "IND" "IPT" "ISN" "ISP" "ITH" "IYK" "JAC" "JAN" "JAX" "JFK" "JLN"
## [133] "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LBE" "LCH" "LEB" "LEX" "LFT"
## [144] "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWB" "LWS" "LYH" "MAF"
## [155] "MBS" "MCI" "MCN" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [166] "MGW" "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH"
## [177] "MOB" "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY"
## [188] "MYR" "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL"
## [199] "OTH" "OXR" "PAH" "PBG" "PBI" "PDX" "PFN" "PGD" "PGV" "PHF" "PHL"
## [210] "PHX" "PIA" "PIE" "PIH" "PIT" "PLN" "PNS" "PQI" "PSC" "PSM" "PSP"
## [221] "PUW" "PVC" "PVD" "PVU" "PWM" "RAP" "RDD" "RDM" "RDU" "RFD" "RHI"
## [232] "RIC" "RKD" "RKS" "RNO" "ROA" "ROC" "ROW" "RST" "RSW" "SAF" "SAN"
## [243] "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SCK" "SDF" "SEA" "SFB"
## [254] "SFO" "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SMX" "SNA" "SPI"
## [265] "SPS" "SRQ" "SSI" "STC" "STL" "STS" "SUN" "SUX" "SWF" "SYR" "TEX"
## [276] "TLH" "TOL" "TPA" "TRI" "TSS" "TTN" "TUL" "TUP" "TUS" "TVC" "TWF"
## [287] "TXK" "TYR" "TYS" "UIN" "VGT" "VLD" "VPS" "WAS" "WYS" "XNA" "YKM"
## [298] "YNG" "YUM"
## 
## $carrier_lg
##  [1] "3M" "99" "9K" "AA" "AS" "AX" "AZ" "B6" "CO" "DH" "DL" "E9" "F9" "FL"
## [15] "G4" "GQ" "HP" "JI" "KS" "LH" "N7" "NJ" "NK" "NW" "OS" "PN" "RP" "SY"
## [29] "TW" "TZ" "U5" "UA" "US" "VX" "WN" "XP" "YX" "YY" "ZV"
## 
## $carrier_low
##  [1] "3M" "99" "9K" "AA" "AQ" "AS" "AX" "AZ" "B6" "BA" "CO" "DH" "DL" "E9"
## [15] "EY" "F9" "FL" "G4" "GQ" "HP" "IB" "JI" "KS" "LH" "N7" "NH" "NJ" "NK"
## [29] "NW" "OH" "OO" "OS" "PN" "RP" "SN" "SY" "TW" "TZ" "U5" "UA" "UP" "US"
## [43] "VX" "WN" "WS" "YV" "YX" "YY" "ZV"

Here is a summary of all of the variables in the dataset:

##       Year         quarter        nsmiles       airport_1     
##  Min.   :2001   Min.   :1.00   Min.   :  67   DCA    :  7730  
##  1st Qu.:2005   1st Qu.:1.00   1st Qu.: 619   DFW    :  7608  
##  Median :2010   Median :2.00   Median : 962   EWR    :  7213  
##  Mean   :2009   Mean   :2.47   Mean   :1095   IAD    :  5596  
##  3rd Qu.:2012   3rd Qu.:3.00   3rd Qu.:1476   IAH    :  5110  
##  Max.   :2014   Max.   :4.00   Max.   :2783   JFK    :  5071  
##                                               (Other):121930  
##    airport_2        passengers            fare          carrier_lg   
##  TPA    :  8271   Min.   :   0.100   Min.   :  17.0   DL     :37638  
##  ORD    :  7902   1st Qu.:   7.609   1st Qu.: 184.3   AA     :24327  
##  LGA    :  5838   Median :  27.283   Median : 233.2   UA     :23099  
##  MDW    :  5660   Mean   : 186.968   Mean   : 242.0   US     :20919  
##  JFK    :  5183   3rd Qu.: 148.152   3rd Qu.: 288.9   WN     :19847  
##  IAD    :  4719   Max.   :7021.087   Max.   :2566.9   CO     :13325  
##  (Other):122685                                       (Other):21103  
##     large_ms         fare_lg        carrier_low        lf_ms       
##  Min.   :0.1000   Min.   :  17.0   DL     :34342   Min.   :0.0100  
##  1st Qu.:0.5083   1st Qu.: 181.6   AA     :28543   1st Qu.:0.2171  
##  Median :0.6772   Median : 231.5   UA     :19720   Median :0.4894  
##  Mean   :0.6883   Mean   : 241.8   US     :19502   Mean   :0.5293  
##  3rd Qu.:0.8965   3rd Qu.: 289.4   WN     :16374   3rd Qu.:0.8762  
##  Max.   :1.0000   Max.   :2566.9   CO     :12084   Max.   :1.0000  
##                                    (Other):29693                   
##     fare_low          haul       
##  Min.   :  12.0   short : 27889  
##  1st Qu.: 167.3   medium:132369  
##  Median : 212.9                  
##  Mean   : 223.4                  
##  3rd Qu.: 265.7                  
##  Max.   :2566.9                  
## 

Across all of the airport pair markets:

What is the shortest airport pair distance?

## Source: local data frame [1 x 3]
## 
##   airport_1 airport_2 nsmiles
##      (fctr)    (fctr)   (int)
## 1       MKE       ORD      67

I’m curious which airport pairs had the most passengers traveling (on average) across all quarters:

## Source: local data frame [7,824 x 3]
## 
##    airport_1 airport_2 mean_passengers
##       (fctr)    (fctr)           (dbl)
## 1        LAX       SFO        4853.820
## 2        JFK       LAX        4450.665
## 3        LGA       ORD        4016.727
## 4        FLL       LGA        3635.271
## 5        ATL       LGA        3553.825
## 6        JFK       SFO        3130.356
## 7        LAX       ORD        2823.389
## 8        LAS       LAX        2815.348
## 9        FLL       JFK        2769.823
## 10       EWR       MCO        2764.043
## ..       ...       ...             ...

Now, I’m wanting to see what the most traveled airport pairs were for any given quarter were:

## Source: local data frame [7,824 x 3]
## 
##    airport_1 airport_2 max_passengers
##       (fctr)    (fctr)          (dbl)
## 1        JFK       LAX       7021.087
## 2        FLL       LGA       6134.340
## 3        LAX       SFO       5829.239
## 4        LGA       ORD       5783.587
## 5        JFK       SFO       5165.326
## 6        FLL       JFK       4974.610
## 7        BOS       LGA       4563.407
## 8        LAS       LAX       4527.000
## 9        ATL       LGA       4494.505
## 10       JFK       MCO       4246.630
## ..       ...       ...            ...

What year year and quarter was the most traveled airport pair from?

## Source: local data frame [1 x 14]
## 
##    Year quarter nsmiles airport_1 airport_2 passengers     fare carrier_lg
##   (dbl)   (dbl)   (int)    (fctr)    (fctr)      (dbl)    (dbl)     (fctr)
## 1  2014       4    2510       JFK       LAX   7021.087 412.7212         DL
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
##   lf_ms (dbl), fare_low (dbl), haul (fctr)

I’m wondering now how many unique airports there are in the dataset:

## Source: local data frame [338 x 1]
## 
##       ap
##    (chr)
## 1    YUM
## 2    YNG
## 3    YKM
## 4    XNA
## 5    WYS
## 6    WAS
## 7    VLD
## 8    UIN
## 9    TYR
## 10   TXK
## ..   ...

So, there are 338 unique airports that exist in various airport pairs.

What are the top ten airports based on the cumulative count of passengers that traveled through them for the time period covered in the dataset?

## Source: local data frame [338 x 4]
## 
##       ap sum_passengers avg_passengers avg_fare
##    (chr)          (dbl)          (dbl)    (dbl)
## 1    ORD        3715830       339.7486 217.0218
## 2    LGA        3043005       281.5511 233.9431
## 3    DFW        2782758       269.6210 226.9230
## 4    EWR        2414071       226.1213 264.6953
## 5    JFK        2209363       215.4635 234.3288
## 6    TPA        2094884       243.4780 189.0277
## 7    LAX        2054506       549.7743 263.5185
## 8    DCA        1955809       187.9321 249.1613
## 9    MDW        1780883       190.6522 194.8835
## 10   IAH        1721053       181.3353 240.0298
## ..   ...            ...            ...      ...

What are the top ten airports based on the average count of passengers that traveled through them across all quarters?

## Source: local data frame [338 x 4]
## 
##       ap sum_passengers avg_passengers avg_fare
##    (chr)          (dbl)          (dbl)    (dbl)
## 1    ATL      1201812.8      1074.9668 192.5350
## 2    MCO      1128439.9      1068.5984 171.9714
## 3    DEN       874282.2       785.5186 202.4002
## 4    LAS      1231757.4       734.0628 196.4701
## 5    PHX       771706.4       633.5849 218.0178
## 6    SEA       643998.6       581.7512 244.1213
## 7    LAX      2054506.4       549.7743 263.5185
## 8    DTW       595011.0       549.4100 209.9226
## 9    MSP       591489.3       536.2550 234.4933
## 10   SAN       519371.6       511.1926 247.7528
## ..   ...            ...            ...      ...

I wonder what the most expensive airports to travel from are:

## Source: local data frame [338 x 4]
## 
##       ap sum_passengers avg_passengers avg_fare
##    (chr)          (dbl)          (dbl)    (dbl)
## 1    TSS       4.590513      0.1995875 616.5870
## 2    JRB       1.524128      0.2177326 516.5652
## 3    IYK    2960.440939     38.4472849 480.1068
## 4    DIK      80.115942     11.4451346 456.3364
## 5    ISN     394.069703     13.1356568 452.4509
## 6    SMX     227.339377      7.8392889 424.4812
## 7    SUN    3965.633287     18.0256059 377.8083
## 8    OXR     220.545458      6.3012988 372.9604
## 9    IDA   16632.282650     41.7896549 361.7234
## 10   ASE   32450.613572     49.9240209 359.4056
## ..   ...            ...            ...      ...

TSS is the East 34th Street Heliport in NY, which technically isn’t an airport, but is included in the DOT fare data. According to Wikipedia, 72% of the flights are airtaxi.

Moving on to other variables…

Looking at fare, which is the price paid for one-way travel between an airport pair:

## 99% 
## 493

The plot looks to have a good normal curve, but with a long tail. The 99th quantile of the dataset is just under 500, so it makes sense to set a limit on the x-axis to better view the histogram of fares

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0   184.3   233.2   242.0   288.9  2567.0

I’m curious what the outliers look like (anything above the 99th quantile):

## Source: local data frame [1,600 x 14]
## 
##     Year quarter nsmiles airport_1 airport_2 passengers     fare
##    (dbl)   (dbl)   (int)    (fctr)    (fctr)      (dbl)    (dbl)
## 1   2013       2    2583       HPN       MRY  0.1098901 2566.860
## 2   2008       1     278       DAL       XNA  0.1098901 2156.020
## 3   2011       4    2508       HPN       MFR  0.1086957 1824.990
## 4   2008       2    1746       EGE       JFK  0.2197802 1813.935
## 5   2003       3    1377       BGM       EFD  0.1000000 1645.000
## 6   2007       2     283       JFK       PHF  0.1098901 1311.990
## 7   2012       1    1372       DAL       SBP  0.2197802 1308.550
## 8   2002       4     840       EFD       GSP  0.1000000 1292.000
## 9   2011       1    1848       BUR       MGM  0.1111111 1255.980
## 10  2014       1    2292       BUR       EYW  0.1111111 1248.000
## ..   ...     ...     ...       ...       ...        ...      ...
## Variables not shown: carrier_lg (fctr), large_ms (dbl), fare_lg (dbl),
##   carrier_low (fctr), lf_ms (dbl), fare_low (dbl), haul (fctr)

Looks like most large fares involve large distances and I’ll want to look at the relationship between fares and distance when I get to bivariate analysis. I also notice that the average passengers per day is under 1 for those city pairs. Another interesting thing to look at is the market share for those city pairs. The DAL<->XNA pair seemed odd because of the low mileage, so I looked on Kayak to see what current prices were and they were all above $500.

I am curious about what the average fare is for airport pairs that have less than 1 person / day on average to see if it is high:

## Source: local data frame [1 x 1]
## 
##    avgfare
##      (dbl)
## 1 297.2354

Surprisingly, it isn’t as high as I thought it would be, but does fall beyond the third quartile.

Next, I want to see a breakdown of fare by haul to see the fare distribution for short and medium flights:

## Source: local data frame [2 x 2]
## 
##     haul mean(fare)
##   (fctr)      (dbl)
## 1  short   195.8801
## 2 medium   251.7311

I can see from this graph that there are more medium flights than short flights in the dataset. It’s understable that the average fare for short flights would cost less than medium flights.

Looking at nsmiles, which is the one-way distance between airport pairs:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      67     619     962    1095    1476    2783

Mileage between airport pairs looks somewhat multimodal with heteroskedasticity or possibly a skewed normal distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

If I re-scale the x-axis (log10), then I can view the long tail better:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.100    7.609   27.280  187.000  148.200 7021.000

The mean is larger than the median in this right-skewed histogram.

Now, let’s see a breakdown of passengers for small and medium flights to see how the feature is affected by flight length:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Source: local data frame [2 x 2]
## 
##     haul mean(passengers)
##   (fctr)            (dbl)
## 1  short         212.5549
## 2 medium         181.5771

I find it interesting to see that short haul trips have more passengers on average than medium trips (e.g. coast-to-coast travel).

Next, looking at fare_lg, which is the largest fare carrier’s average fare for the quarter:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0   181.6   231.5   241.8   289.4  2567.0

The largest fares seem to closely match the average fares, which makes me think that very few low fares are sold in quantity.

So, now let’s look at fare_low, which are the lowest fare carrier’s average fare:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    12.0   167.4   212.9   223.4   265.7  2567.0

The distribution looks compressed, but similar to the large fare distribution, albeit with a lower mean and median.

Now, I’m curious about the difference between the average low and large fares:

## Source: local data frame [1 x 1]
## 
##   fare_diff
##       (dbl)
## 1  18.37674

In terms of percentage of the average fare:

## Source: local data frame [1 x 1]
## 
##   mean(fare_diff_percentage)
##                        (dbl)
## 1                 0.07012206

So, it looks like fares have a variance of 7% of the average fare between the average large and low fares.

Now, let me take a look at large_ms, which is the large fare carrier’s market share:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1000  0.5083  0.6772  0.6883  0.8965  1.0000

Seems like there is a sizable count of carriers with the largest fare that have 100% market share. I’d like to compare that to the rest:

## Source: local data frame [2 x 2]
## 
##   large_ms == 1      n
##           (lgl)  (int)
## 1         FALSE 147608
## 2          TRUE  12650
##            n
## 1 0.08569996

So, 8.5% of all of the fare data is from carriers that have 100% market share between two airport pairs.

What is the market share distribution for large fare carriers based on trip length?

The distribution for medium length trips closely resembles the overall distribution, which is indicative of a higher quantity of medium haul trips having taken place over the reporting period.

So, what percentage of trips are short haul trips overall?

## Source: local data frame [2 x 2]
## 
##   haul == "short"      n
##             (lgl)  (int)
## 1           FALSE 132369
## 2            TRUE  27889
##           n
## 1 0.2106913

I was expecting a much lower percentage than 21%, however, it does help round out the picture of the overall market. Knowing that 79% of airport pairs are medium haul distance (> 500 miles) is corroborated by the mean statistic of nsmiles.

Now, let me take a look at lf_ms, which is the low fare carrier’s market share:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.2171  0.4894  0.5293  0.8762  1.0000

This last plot is interesting, since there is a spike at 10%, so I looked at the data to verify that it was good, which it is. The reason for this spike in the data is explained by the DOT:

Note that the “lowest fare carrier” is the carrier with the lowest average fare that has at least a 10 percent share of the traffic in the market, except for markets where only a single carrier has a 10 percent or greater share.

Let’s take a closer look at this plot and what the underlying data looks like:

## Source: local data frame [7,270 x 8]
## 
##     Year quarter nsmiles airport_1 airport_2 carrier_low  lf_ms fare_low
##    (dbl)   (dbl)   (int)    (fctr)    (fctr)      (fctr)  (dbl)    (dbl)
## 1   2001       3     696       ABE       MDW          CO 0.0741      179
## 2   2001       3     655       ABE       ORD          DL 0.0144      198
## 3   2001       3    1270       ABQ       ATL          TW 0.0834      143
## 4   2001       3     744       ABQ       IAH          AA 0.0286      117
## 5   2001       3     677       ABQ       LAX          HP 0.0853       97
## 6   2001       3     719       ABQ       MCI          CO 0.0185       99
## 7   2001       3    1122       ABQ       MDW          AA 0.0152      146
## 8   2001       3     889       ABQ       OAK          HP 0.0725      119
## 9   2001       3     610       ABQ       SAT          AA 0.0714      138
## 10  2001       3     718       ALB       MDW          CO 0.0220      109
## ..   ...     ...     ...       ...       ...         ...    ...      ...

From the above data it looks like ABQ (Albuquerque International Sunport) is dominated by a single large carrier.

Since we looked at distribution for large fare carriers based on trip length, we might as well look at the same for low fare carriers:

Nothing really stands out from this plot as significant to me. The distributions look roughly the same for both short haul and medium haul flights.

Finally, I’m curious about what percentage of carriers are both the largest and lowest fare carrier:

## Source: local data frame [1 x 1]
## 
##   n()/nrow(af)
##          (dbl)
## 1    0.5591921

So, 55.9% of carriers are both the largest and lowest fare carrier, which could mean many things. One possibility might be that a majority of carriers remain competitive by offering low fares in addition to being the largest fare carrier.

Univariate Analysis

What is the structure of your dataset?

There are 160,258 records in the dataset with 13 features (Year, quarter, nsmiles, airport_1, airport_2, passengers, fare, carrier_lg, large_ms, fare_lg, carrier_low, lf_ms, and fare_low). The variables airport_1, airport_2, carrier_lg, and carrier_low are unordered factor variables.

Other observations:

  • the average miles traveled is 1095
  • the average passengers per day is 187
  • the median passengers per day is 27
  • the average of average quarterly fares is $242 and the max is $2567
  • Delta is the most common carrier for the largest and lowest fares

What is/are the main feature(s) of interest in your dataset?

The main features of interest to me in the dataset are fare, nsmiles, and passengers. I’d like to figure out what variables influence the fare. My guess is that distance between airports is a factor and that the activity between airports may have some weight.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Market share may have an impact on the average fare. My guess would be that areas that have high market share have less competition and may have higher fares.

Did you create any new variables from existing variables in the dataset?

At the time of my univariate analysis I had not created any variables. However, after performing bivariate analysis I thought it would be good to create an additional categorical variable, haul, for segmenting short and medium trips. A medium trip is anything over 500 miles. I then went back and performed additional univariate analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Data cleaning was the bulk of the work during my investigation. I saw anomalies in market share, large and low fares, and erroneous data that needed to be marked as NA. Some variable naming changed across the years of the dataset and some fields were combined, so I had to fix those as well. I added the fixup code to the data loading functions. These changes were necessary in order to have consistent data to analyze.

Bivariate Plots Section

Let’s start by taking a look at the correlation matrix across all features:

##                   Year       quarter      nsmiles    passengers
## Year        1.00000000 -0.0387753123  0.127808813  0.0140409715
## quarter    -0.03877531  1.0000000000 -0.003134517  0.0007013818
## nsmiles     0.12780881 -0.0031345172  1.000000000 -0.0144414386
## passengers  0.01404097  0.0007013818 -0.014441439  1.0000000000
## fare        0.31893655 -0.0370997782  0.457826614 -0.1829057014
## large_ms   -0.02335850 -0.0056159138 -0.343139151 -0.0690767254
## fare_lg     0.29604732 -0.0357017595  0.431890715 -0.1551365111
## lf_ms       0.08091732 -0.0161899485 -0.236037321 -0.0919099302
## fare_low    0.32965613 -0.0215216567  0.376823089 -0.1788191433
##                   fare     large_ms     fare_lg       lf_ms     fare_low
## Year        0.31893655 -0.023358503  0.29604732  0.08091732  0.329656135
## quarter    -0.03709978 -0.005615914 -0.03570176 -0.01618995 -0.021521657
## nsmiles     0.45782661 -0.343139151  0.43189072 -0.23603732  0.376823089
## passengers -0.18290570 -0.069076725 -0.15513651 -0.09190993 -0.178819143
## fare        1.00000000 -0.140722104  0.96032828 -0.07386059  0.905739991
## large_ms   -0.14072210  1.000000000 -0.13615080  0.69085587  0.004520862
## fare_lg     0.96032828 -0.136150796  1.00000000 -0.11410075  0.873874571
## lf_ms      -0.07386059  0.690855870 -0.11410075  1.00000000  0.081461978
## fare_low    0.90573999  0.004520862  0.87387457  0.08146198  1.000000000

Based on the correlation matrix it seems that it’s worth looking into mileage, year, and fare further:

## 
##  Pearson's product-moment correlation
## 
## data:  af$nsmiles and af$fare
## t = 206.15, df = 160260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4539482 0.4616877
## sample estimates:
##       cor 
## 0.4578266
## 
##  Pearson's product-moment correlation
## 
## data:  af$Year and af$fare
## t = 134.71, df = 160260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3145317 0.3233276
## sample estimates:
##       cor 
## 0.3189366

As predicted, there is a strong positive correlation between mileage and fare, but passenger count has a negligible relationship to price. Year has a moderate positive correlation with fare.

Let me perform a plot matrix of Year, quarter, fare, nsmiles, passengers, carrier_lg, large_ms to get a sense of what the data looks like visually:

Based on the correlation values between nsmiles and fare, let’s take a look at that plot:

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

The trend line (red), shows a positive correlation between mileage and fare. In other words, fare increases as mileage increases, which makes sense intuitively. The blue line shows the smoothing of the point data and the green line shows the low fare carrier’s average fare (smoothed as well).

Let’s also take a look at fare as categorized by haul (trip length):

## Source: local data frame [2 x 4]
## 
##     haul mean(fare) median(fare) IQR(fare)
##   (fctr)      (dbl)        (dbl)     (dbl)
## 1  short   195.8801      186.556  100.1681
## 2 medium   251.7311      241.890  101.9789

Interestingly enough, the interquartile range (IQR) is roughly the same between short and medium haul trips.

Let’s take a look at Year and fare:

From the look of the fare data plotted it looks like there was an increase in air travel for 2011-2013, which has a darker region. There’s also a average trend upward (dotted line) of fare increasing over time.

Now, let me take a look at nsmiles and large_ms:

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

With the concentration of 100% market share carriers, it makes sense to separate out the 100% market share carriers and view a distribution:

So, 100% market share carriers seem to have bi-modal distribution primarily around 200 and 900 miles when looking at the density smoothing.

Continuing with a scatterplot of the rest of the large fare carriers not having 100% market share:

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

It is interesting to see that carriers with the largest fare lose market share as the distance increases. My guess is that with the correlation between distance and fare that there is more competition across longer flights because there is more revenue to be gained.

Now, let’s do the same with low fare carriers by looking at nsmiles and lf_ms:

The low fare 100% market share carriers also seem to have bi-modal distribution around 200 and 900 miles.

Continuing with a scatterplot of the rest of the low fare carriers not having 100% market share:

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

There is also a similar trend here where carriers lose market share as the flight distance increases, however, a little more significantly than large fare carriers.

Continuing with passengers, what’s the trend across all of the years in the dataset?

Looks like 2012 was the best year for average passengers traveling, a drop in 2013, and a return in 2014.

I wonder what we can see if we plot the fare variances across all the large fare carriers:

It’s a busy plot, however I can see that United Airlines (UA) has the largest median fare. Let me double check this is the case by computing median statistics of the large fare for all large fare carriers (sorted high-to-low):

## Source: local data frame [38 x 2]
## 
##    carrier_lg median_fare_lg
##        (fctr)          (dbl)
## 1          UA        275.080
## 2          VX        266.665
## 3          9K        261.735
## 4          DL        251.130
## 5          AA        238.210
## 6          US        233.260
## 7          RP        232.330
## 8          HP        231.270
## 9          CO        219.090
## 10         NW        218.700
## ..        ...            ...

Might as well view the plot of the top ten large carriers (based on median fare):

Let me do the same for low fare carriers (skipping the large plot):

Cape Air (9K) has the largest fare among the low fare carriers. I’ll double check by similarly computing and sorting the median statistic for low fares across all low fare carriers:

## Source: local data frame [48 x 2]
## 
##    carrier_low median_fare_low
##         (fctr)           (dbl)
## 1           9K         253.240
## 2           UA         244.050
## 3           WS         239.930
## 4           RP         228.030
## 5           AA         227.290
## 6           DL         225.460
## 7           US         225.150
## 8           VX         209.785
## 9           AS         206.194
## 10          HP         203.910
## ..         ...             ...

After taking a look at the airlines, I’m curious what I might find if I plot the fares across the most traveled airports:

The dotted line is the overall median of fare across all airport pairs. It’s interesting to see the average fares among the most traveled airports and see which ones are higher or lower than the median fare among the group.

So, which of the airports in the plot above have a median fare above the overall median fare?

## Source: local data frame [6 x 3]
## 
##   larger_ap median_fare      diff
##       (chr)       (dbl)     (dbl)
## 1       LAX    275.4786 42.290852
## 2       EWR    261.3781 28.190415
## 3       VPS    244.9723 11.784548
## 4       DCA    242.0185  8.830777
## 5       XNA    238.5907  5.402982
## 6       IAH    237.1551  3.967411

Well, it looks like if you’re in any of the cities above that you’re paying more than the median fare across all airports, but it also depends on where you’re going.Also, it’s worth mentioning that these airports serve plenty of markets, which could be a factor.

Finally, let’s take a first step towards building a linear model based on fare and nsmiles:

## 
## Call:
## lm(formula = fare ~ nsmiles, data = af)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -252.76  -50.20   -9.17   38.68 2232.60 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.741e+02  3.798e-01   458.4   <2e-16 ***
## nsmiles     6.200e-02  3.008e-04   206.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.7 on 160256 degrees of freedom
## Multiple R-squared:  0.2096, Adjusted R-squared:  0.2096 
## F-statistic: 4.25e+04 on 1 and 160256 DF,  p-value: < 2.2e-16

Looks like we can account for 20.96% of the fare variance with this model.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Fare correlates strongly with the distance that is being traveled. There has also been a slow trend towards increasing prices since 2005, but it might be good to compare this to inflation to see if actual costs have increased. Passengers had a negligible relationship to fare.

Based on the \(R^2\) value, distance accounts for 20.96% of the variance in fare. Other features can be incorporated into the model to better explain the variance in price, which I will build on in multivariate analysis.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, market share seems to drop when distance traveled increases. I think that is because there is more competition among longer distance flights, since there is more revenue to be gained.

What was the strongest relationship you found?

The strongest relationship I found was between fare and distance.

Multivariate Plots Section

To ease into our multivariate analysis, let’s first take a look at the ratio of fare to nsmiles (bivariate), which could be considered fare value or cost per mile for fares (split by trip length):

## Source: local data frame [1 x 1]
## 
##   short_haul_value
##              (dbl)
## 1        0.6775934
## Source: local data frame [1 x 1]
## 
##   med_haul_value
##            (dbl)
## 1      0.2301141

If you’re traveling under 500 miles, then you can expect to pay $0.68 per mile on average. If you’re traveling over 500 miles, then you can can expect to pay roughly $0.23 per mile on average.

Now, I want to see a better breakdown of fare value per distance traveled:

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## 
## Formula: fare/nsmiles ~ A/sqrt(nsmiles)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## A   9.6914     0.0125   775.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1875 on 160257 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 6.06e-11

As you can see there is a lot of variance in cost per mile for travel under 500 miles. However, as a consumer if you travel larger distances you will eventually enjoy much better costs per mile.

In the plot above, the individual data points (colored in red or blueish-green) are overlayed with a smoothed conditional mean of the data in blue. The green and yellow lines are piece-wise linear fits for short and medium haul trips. Finally, the black line is a rough approximation of a non-linear fit that is essentially an inverse square root function.

With the above plot of fare value value being non-linear, it makes me want to revisit fare vs miles traveled (bivariate analysis) with different scales:

I now have a much better linear fit here. Let’s revisit the linear model and see if it improves:

## 
## Call:
## lm(formula = I(log10(fare)) ~ I(sqrt(nsmiles)), data = af)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14930 -0.08221  0.00236  0.08421  1.08870 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.119e+00  1.148e-03  1846.2   <2e-16 ***
## I(sqrt(nsmiles)) 7.545e-03  3.468e-05   217.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.133 on 160256 degrees of freedom
## Multiple R-squared:  0.2279, Adjusted R-squared:  0.2279 
## F-statistic: 4.731e+04 on 1 and 160256 DF,  p-value: < 2.2e-16

\(R^2\) went up by 1.83% with this change, so we can now better account for the relationship between fares and distance between airports.

Continuing with the concept of fare value, let’s take a look at the fare value across the largest airports (by activity):

## Source: local data frame [10 x 2]
## 
##    larger_ap median_value
##        (chr)        (dbl)
## 1        LAX     5.768249
## 2        TPA     5.688935
## 3        JFK     4.678094
## 4        LGA     4.536045
## 5        IAH     4.080295
## 6        MDW     4.064503
## 7        EWR     4.023544
## 8        DCA     3.921803
## 9        DFW     3.913508
## 10       ORD     3.585313

Here, we can see that LAX (Los Angeles Airport) fares enjoy the most bang for the buck (i.e. miles per dollar) among the larger airports. ORD (Chicago Airport) is among the weakest in terms of miles per dollar. You could consider this if you were moving to either of these two cities and planning to travel quite a bit.

I now want to see what the fare value is across large fare carriers:

While this is a fairly busy plot, there is one airline that sticks out among the rest: National Airlines (N7). Let’s look at some of the data for the airline:

## Source: local data frame [12 x 14]
## 
##     Year quarter nsmiles airport_1 airport_2 passengers     fare
##    (dbl)   (dbl)   (int)    (fctr)    (fctr)      (dbl)    (dbl)
## 1   2002       1    1515       LAS       ORD   2074.440 142.7000
## 2   2002       2    1515       LAS       ORD   2441.640 139.4100
## 3   2002       3    2066       IAD       LAS    320.970 209.5000
## 4   2002       3    1515       LAS       ORD   2437.500 127.9900
## 5   2001       2    2248       JFK       LAS   2223.626 169.5230
## 6   2001       2    1521       LAS       MDW   1936.044 127.0379
## 7   2001       3    1055       DFW       LAS   1474.000 138.0000
## 8   2001       3    2248       JFK       LAS   2227.000 166.0000
## 9   2001       3    1521       LAS       MDW   1628.000 123.0000
## 10  2001       3    2176       LAS       MIA    684.000 158.0000
## 11  2001       3    2177       LAS       PHL   1297.000 156.0000
## 12  2001       3     414       LAS       SFO   2340.000  77.0000
## Variables not shown: carrier_lg (fctr), large_ms (dbl), fare_lg (dbl),
##   carrier_low (fctr), lf_ms (dbl), fare_low (dbl), haul (fctr)

National Airlines gave the best mileage per dollar fares between 2001-2002, but it might explain why they are no longer around, too.

Let’s zoom in on the fare value among the most popular carriers:

## Source: local data frame [7 x 2]
## 
##   carrier_lg median_value
##       (fctr)        (dbl)
## 1         VX     7.246457
## 2         B6     6.754951
## 3         WN     5.078585
## 4         AA     4.472823
## 5         UA     4.424600
## 6         DL     3.969786
## 7         US     3.059363

Historically, it looks like Virgin America and JetBlue Airways are the two airlines that provide the best value for mileage per dollar. If you’re traveling long distances across country, it’s likely you’ll find good fares with those two airlines.

Let’s take a look at the fare value over the years (bivariate):

Over the years we have seen an increasing cost per mile traveled, but currently we’re experience a downward trend in cost per mile.

So, revisiting our linear model of log10(fare) ~ sqrt(nsmiles) + additional variables, let’s take a look at residuals:

## R-squared for model: 0.556
## [1] 0.1008139
## Don't know how to automatically pick scale for object of type AsIs. Defaulting to continuous

The individual residual data is plotted as points. The blue line is the standard deviation of all residuals, which is 0.1. The black line plot is the standard deviation at each reference mileage (or rather sqrt of mileage).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

With an understanding of the strong relationship between distance and fare it made sense to me to look at other variables from that perspective. After seeing the relationship of fare value (cost per mile) to distance I thought to improve on the linear model that I had started in bivariate analysis.

Were there any interesting or surprising interactions between features?

The cost per mile graph was surprising in that the cost can go from $1 / mi at around a 250 mile distance to less than $0.40 / mi at around a 500 mile distance. The additional analysis into worst and best value airports and carriers was interesting to me.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model using the log10 of fare and the square root of nsmiles. Other variables included are Year and airport pair (i.e. airport_1 and airport_2).

The variables of the linear model account for 55.6% of the variance in airline fares.


Final Plots and Summary

Plot One

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: position_dodge requires constant width: output may be incorrect

Description One

One interesting aspect of travel in general is the cost per mile for a transport mode. The cost per mile for air travel has fluctated over time and was the worst for a 14 year span (2001-2014) in 2008. Currently, we are experiencing a downward trend in cost per mile traveled as of 2014 Q4 with an inflection point in 2013. The blue line depicts the average cost per mile on a quarterly interval time series, which you can see fluctuates from quarter to quarter. The orange line is a smoothing of the same data to show trends more easily. The boxplot underneath shows the variance of cost per mile segmented yearly with the box representing the interquartile range (IQR) and the median by the horizontal bar within the plot. The median cost per mile was also the highest in 2008.

Plot Two

Description Two

Mileage per dollar spent can be a metric for value. If you consider the mileage per dollar for airports, then you might have a sense of which airports offer good value for flying to or from, especially if you have a business where travel is frequent and you would like to minimize costs. If you consider the mileage per dollar for popular airlines, then you might have a sense of which airlines generally are going to have the best prices and ultimately, which frequent flyer programs or credit cards to join / acquire.

The plot of mileage per dollar across the ten largest airports (by activity) shows that Los Angeles International Airport (LAX) and Tampa International Airport (TPA) have the best median mileage per dollar spent. O’Hare International Airport (ORD) has the worst median mileage per dollar.

The plot of mileage per dollar across the most popular airlines shows that Virgin America and JetBlue Airways provide the best value in terms of how far you can travel per dollar. US Airways (now owned by American Airlines) is the worst in terms of mileage per dollar.

Plot Three

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Description Three

Another way of considering value with air travel is how much the cost per mile is for a given trip length. If you consider the value based on traveled miles, then you might travel farther to get the most travel for your money.

The plot of fare value (i.e. cost per mile) with one-way distance between airports roughly follows an inverse square root function. The cost per mile has high variance for travel up to 500 miles as you can see values can vary (roughly) between $1.25 (and beyond) to $0.38 after smoothing. For travel above 500 miles, the variance roughly sits between $0.38 and $0.12 after smoothing. It’s interesting to see that the distinction between short and medium haul trips by the air travel industry at 500 miles matches a “corner turn” in terms of fare value as well.


Reflection

The DOT Domestic Airline Consumer Airfare Report has quarterly airfare data for 1996-2014. Unfortunately, only 2001-2014 had complete and usable data, so that is what was used by me to amass the 160,258 airport pair records. I started by simply cleaning and aggregating the data. Then, I explored single, double, and multiple variables in various ways to better understand the data. I was surprised that passenger activity did not have much weight on the predictive model of fare, but indirectly that may have been captured in the airport pair. The only strong positive correlation with fare was distance between airports, which was the basis of the linear model created. A secondary influence was Year, since there has been a general trend of increasing prices over time. The linear model improved once I realized that converting a non-linear relationship to a linear relationship (log10(fare) ~ sqrt(nsmiles) would better account for variance. Some limitations with this model are that it is averaged data instead of individual fares and that there are missing factors, such as how many days before the flight that the fare was purchased. If I had powerful computing at my disposal I would want to perform a similar analysis to this one on individual fare data.

Some interesting observations that I made while performing this analysis: