Overview

This document will guide you through some data analysis tasks with a focus on fitting tree-based models and training them. We’ll also (re)-visit some other topics.

While this is in some sense a stand-alone analysis, I assume that you have worked through the Data Analysis exercise and are familiar with the dataset and all the things we discovered during the cleaning process. We’ll use the same dataset here but focus on a different outcome. Other than that, the way to work through the exercise is like in the Data Analysis exercise, namely by writing/completing the missing code.

Project setup

We need a variety of different packages, which are loaded here. Install as needed. For this analysis, we’ll again use the caret package. If you use others, load them here.

Data loading and cleaning

We will again use the Norovirus dataset.

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Author = col_character(),
##   EpiCurve = col_character(),
##   TDComment = col_character(),
##   AHComment = col_character(),
##   Trans1 = col_character(),
##   Trans2 = col_character(),
##   Trans2_O = col_character(),
##   Trans3 = col_character(),
##   Trans3_O = col_character(),
##   Vehicle_1 = col_character(),
##   Veh1 = col_character(),
##   Veh1_D_1 = col_character(),
##   Veh2 = col_character(),
##   Veh2_D_1 = col_character(),
##   Veh3 = col_character(),
##   Veh3_D_1 = col_character(),
##   PCRSect = col_character(),
##   OBYear = col_character(),
##   Hemisphere = col_character(),
##   season = col_character()
##   # ... with 44 more columns
## )
## See spec(...) for full column specifications.
## Warning: 2 parsing failures.
##  row col expected     actual           file
## 1022 CD  a double GGIIb      'norodata.csv'
## 1022 gge a double Sindlesham 'norodata.csv'

Looking at the outcome

For this analysis, we consider as our main outcome of interest the season. This is a categorical outcome with more than 2 categories, something we haven’t looked at before. Because it’s more than 2 categories, a basic logistic model won’t work. Fortunately, tree-based models can deal with multiple categories.

##  chr [1:1022] "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" ...
## Warning: Ignoring unknown parameters: binwidth, bins, pad

We already knew from previous explorations that some entries do not have a season. We could either code them as “other” and keep them in the model, or remove them. Since it’s hard to see any potential scientific reason why there should be a correlation between an “other” season and some variable, we’ll remove it here.

We also notice that while winter is dominant (makes sense, we know that norovirus is more common in winter), we got a decent number of outbreaks for each season, so we shouldn’t have a problem with (un)balanced data.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Selecting predictors

We will pick similar variables as previously, but with some changes. Keep the following variables: Action1, CasesAll, Country, Deaths, EndMonth, GG2C4, Hemisphere, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, OBYear, Path1, RateAll, RiskAll, Season, Setting, StartMonth, State, Trans1, Vomit.

Cleaning predictors

We’ll have to perform the usual cleaning steps. You might have realized by now that even for the same dataset, cleaning steps can differ based on the outcome (and as you see below, the model).

Let’s first check for missing values.

##  chr [1:955] "Yes" NA "Yes" NA "Yes" NA NA NA "Yes" NA NA NA NA NA NA ...

Looks like none of the new variables we included had a ton of missing, so we would probably be ok just removing any observation that has missing data. However, tree-based models can deal with missing data in predictors. Therefore, we’ll keep them for now. We’ll later compare how the model does or does not change if we remove those observations.

Let’s make sure everything has the right format (numeric/integer/factor). Adjust/recode variables as needed. You will likely find that as you convert OBYear to numeric, something doesn’t quite work. Take a look. Fix by removing the observation with the troublesome entry, then convert to numeric. Finally, remove the observations that have 0 as OByear - there are more than 1 now.

##  [1] "1999"      "1998"      "2006"      "2004"      "1993"     
##  [6] "2002"      "2005"      "1997"      "2003"      "1994"     
## [11] "2008"      "2000"      "2001"      "1995"      "1996"     
## [16] "2007"      "2009"      "1990"      "0"         "1983"     
## [21] "2010"      "1992"      "2002-2007"

Look at the data to see what else we need to do.

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 953 obs. of  22 variables:
##  $ Action1         : chr  "Unspecified" "Unspecified" "Unspecified" "Unspecified" ...
##  $ CasesAll        : num  15 65 27 4 15 6 40 10 116 45 ...
##  $ Country         : chr  "Japan" "USA" "Other" "Other" ...
##  $ Deaths          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EndMonth        : num  12 9 0 0 0 0 0 0 11 11 ...
##  $ gg2c4           : chr  "Yes" "No" "Yes" "No" ...
##  $ Hemisphere      : chr  "Northern" "Northern" "Northern" "Northern" ...
##  $ Hospitalizations: num  0 0 0 0 0 0 0 0 5 10 ...
##  $ MeanD1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MeanI1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MedianD1        : num  0 36 0 0 0 0 0 0 0 48 ...
##  $ MedianI1        : num  0 37 0 0 0 0 0 0 0 31 ...
##  $ OBYear          : num  1999 1998 2006 2006 2006 ...
##  $ Path1           : chr  "No" "No" "Unspecified" "Unspecified" ...
##  $ RateAll         : num  0 39.8 20.8 100 60 ...
##  $ RiskAll         : num  0 108 130 4 25 ...
##  $ season          : chr  "Fall" "Fall" "Fall" "Fall" ...
##  $ Setting_1       : chr  "Daycare Center" "Boxed lunch, football game" "buffet" "restaurant" ...
##  $ StartMonth      : num  11 9 9 10 11 11 11 11 11 11 ...
##  $ State           : chr  "0" "NC, FL" "0" "0" ...
##  $ Trans1          : chr  "Unspecified" "Foodborne" "Foodborne" "Foodborne" ...
##  $ Vomit           : num  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  5 variables:
##   ..$ row     : int  1022 1022
##   ..$ col     : chr  "CD" "gge"
##   ..$ expected: chr  "a double" "a double"
##   ..$ actual  : chr  "GGIIb" "Sindlesham"
##   ..$ file    : chr  "'norodata.csv'" "'norodata.csv'"
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   Author = col_character(),
##   ..   Pub_Year = col_double(),
##   ..   pubmedid = col_double(),
##   ..   EpiCurve = col_character(),
##   ..   TDComment = col_character(),
##   ..   AHComment = col_character(),
##   ..   Trans1 = col_character(),
##   ..   Trans1_O = col_double(),
##   ..   Trans2 = col_character(),
##   ..   Trans2_O = col_character(),
##   ..   Trans3 = col_character(),
##   ..   Trans3_O = col_character(),
##   ..   Risk1 = col_double(),
##   ..   Risk2 = col_double(),
##   ..   RiskAll = col_double(),
##   ..   Cases1 = col_double(),
##   ..   Cases2 = col_double(),
##   ..   CasesAll = col_double(),
##   ..   Rate1 = col_double(),
##   ..   Rate2 = col_double(),
##   ..   RateAll = col_double(),
##   ..   Hospitalizations = col_double(),
##   ..   Deaths = col_double(),
##   ..   Vehicle_1 = col_character(),
##   ..   Veh1 = col_character(),
##   ..   Veh1_D_1 = col_character(),
##   ..   Veh2 = col_character(),
##   ..   Veh2_D_1 = col_character(),
##   ..   Veh3 = col_character(),
##   ..   Veh3_D_1 = col_character(),
##   ..   PCRSect = col_character(),
##   ..   OBYear = col_character(),
##   ..   Hemisphere = col_character(),
##   ..   season = col_character(),
##   ..   MeanI1 = col_double(),
##   ..   MedianI1 = col_double(),
##   ..   Range_S_I1 = col_double(),
##   ..   Range_L_I1 = col_double(),
##   ..   MeanD1 = col_double(),
##   ..   MedianD1 = col_double(),
##   ..   Range_S_D1 = col_double(),
##   ..   Range_L_D1 = col_double(),
##   ..   MeanA1 = col_double(),
##   ..   MedianA1 = col_double(),
##   ..   Range_Y_A1 = col_character(),
##   ..   Range_O_A1 = col_double(),
##   ..   Action1 = col_character(),
##   ..   Action2_1 = col_character(),
##   ..   Secondary = col_character(),
##   ..   MeanI2 = col_double(),
##   ..   MedianI2 = col_double(),
##   ..   Range_S_I2 = col_double(),
##   ..   Range_L_I2 = col_double(),
##   ..   MeanD2 = col_double(),
##   ..   MedianD2 = col_double(),
##   ..   Range_S_D2 = col_double(),
##   ..   Range_L_D2 = col_double(),
##   ..   `Mea 2` = col_double(),
##   ..   `Media 2` = col_double(),
##   ..   Range_Y_A2 = col_double(),
##   ..   Range_O_A2 = col_double(),
##   ..   Comments_1 = col_character(),
##   ..   Path1 = col_character(),
##   ..   Path2_1 = col_character(),
##   ..   Country = col_character(),
##   ..   Category = col_character(),
##   ..   State = col_character(),
##   ..   Setting_1 = col_character(),
##   ..   StartMonth = col_double(),
##   ..   EndMonth = col_double(),
##   ..   GGA = col_double(),
##   ..   CA = col_double(),
##   ..   SA = col_character(),
##   ..   new_GGA = col_double(),
##   ..   new_CA = col_double(),
##   ..   new_SA = col_character(),
##   ..   SA_resolved_from = col_character(),
##   ..   GGB = col_double(),
##   ..   CB = col_character(),
##   ..   SB = col_character(),
##   ..   new_GGB = col_double(),
##   ..   new_CB = col_double(),
##   ..   new_SB = col_character(),
##   ..   SB_resolved_from = col_character(),
##   ..   GGC = col_double(),
##   ..   CC = col_double(),
##   ..   SC = col_character(),
##   ..   new_ggc = col_double(),
##   ..   new_cc = col_double(),
##   ..   new_sc = col_character(),
##   ..   SC_resolved_from = col_character(),
##   ..   GGD = col_double(),
##   ..   CD = col_double(),
##   ..   SD = col_character(),
##   ..   new_ggd = col_double(),
##   ..   new_cd = col_double(),
##   ..   new_sd = col_double(),
##   ..   SD_resolved_from = col_logical(),
##   ..   StrainOther = col_character(),
##   ..   strainother_rc = col_character(),
##   ..   gge = col_double(),
##   ..   ce = col_double(),
##   ..   se = col_character(),
##   ..   SE_resolved_from = col_character(),
##   ..   ggf = col_double(),
##   ..   cf = col_double(),
##   ..   sf = col_character(),
##   ..   ggg = col_double(),
##   ..   cg = col_double(),
##   ..   sg = col_character(),
##   ..   ggh = col_double(),
##   ..   ch = col_double(),
##   ..   sh = col_character(),
##   ..   ggi = col_double(),
##   ..   ci = col_double(),
##   ..   si = col_character(),
##   ..   ggj = col_double(),
##   ..   cj = col_double(),
##   ..   sj = col_character(),
##   ..   Country2 = col_character(),
##   ..   Veh1_D_2 = col_character(),
##   ..   Veh2_D_2 = col_character(),
##   ..   Veh3_D_2 = col_character(),
##   ..   Action2_2 = col_character(),
##   ..   Comments_2 = col_character(),
##   ..   Path2_2 = col_character(),
##   ..   Setting_2 = col_character(),
##   ..   category1 = col_character(),
##   ..   strainothergg2c4 = col_double(),
##   ..   gg2c4 = col_character(),
##   ..   Vomit = col_double(),
##   ..   IncInd = col_double(),
##   ..   SymInd = col_double(),
##   ..   PooledLat = col_double(),
##   ..   PooledSym = col_double(),
##   ..   PooledAge = col_double(),
##   ..   IndividualLatent = col_logical(),
##   ..   IndividualSymptomatic = col_character()
##   .. )
##    Action1             CasesAll       Country              Deaths       
##  Length:953         Min.   :    1   Length:953         Min.   :0.00000  
##  Class :character   1st Qu.:    9   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median :   25   Mode  :character   Median :0.00000  
##                     Mean   :  129                      Mean   :0.05379  
##                     3rd Qu.:   64                      3rd Qu.:0.00000  
##                     Max.   :32150                      Max.   :9.00000  
##                     NA's   :5                          NA's   :42       
##     EndMonth         gg2c4            Hemisphere        Hospitalizations  
##  Min.   : 0.000   Length:953         Length:953         Min.   :  0.0000  
##  1st Qu.: 0.000   Class :character   Class :character   1st Qu.:  0.0000  
##  Median : 0.000   Mode  :character   Mode  :character   Median :  0.0000  
##  Mean   : 2.559                                         Mean   :  0.7113  
##  3rd Qu.: 4.000                                         3rd Qu.:  0.0000  
##  Max.   :12.000                                         Max.   :125.0000  
##                                                         NA's   :42        
##      MeanD1            MeanI1           MedianD1          MedianI1     
##  Min.   :  0.000   Min.   : 0.0000   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:  0.000   1st Qu.: 0.000  
##  Median :  0.000   Median : 0.0000   Median :  0.000   Median : 0.000  
##  Mean   :  1.558   Mean   : 0.7125   Mean   :  2.611   Mean   : 1.703  
##  3rd Qu.:  0.000   3rd Qu.: 0.0000   3rd Qu.:  0.000   3rd Qu.: 0.000  
##  Max.   :273.600   Max.   :48.0000   Max.   :235.200   Max.   :65.000  
##                                                                        
##      OBYear        Path1              RateAll          RiskAll       
##  Min.   :1983   Length:953         Min.   :  0.00   Min.   :    0.0  
##  1st Qu.:2000   Class :character   1st Qu.:  0.00   1st Qu.:    0.0  
##  Median :2003   Mode  :character   Median : 16.50   Median :   20.5  
##  Mean   :2002                      Mean   : 27.04   Mean   :  399.7  
##  3rd Qu.:2005                      3rd Qu.: 49.00   3rd Qu.:  110.8  
##  Max.   :2010                      Max.   :105.00   Max.   :35000.0  
##                                    NA's   :103      NA's   :115      
##     season           Setting_1           StartMonth        State          
##  Length:953         Length:953         Min.   : 0.000   Length:953        
##  Class :character   Class :character   1st Qu.: 2.000   Class :character  
##  Mode  :character   Mode  :character   Median : 5.000   Mode  :character  
##                                        Mean   : 5.894                     
##                                        3rd Qu.:10.000                     
##                                        Max.   :12.000                     
##                                                                           
##     Trans1              Vomit       
##  Length:953         Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :1.0000  
##                     Mean   :0.5074  
##                     3rd Qu.:1.0000  
##                     Max.   :1.0000  
##                     NA's   :1
##  [1] "0"                                                                 
##  [2] "NC, FL"                                                            
##  [3] "FL"                                                                
##  [4] "LA, MD, MS, NC"                                                    
##  [5] "AK"                                                                
##  [6] "WI"                                                                
##  [7] "WA, FL"                                                            
##  [8] "TX"                                                                
##  [9] "WV, MD, FL, NY, PA, VA"                                            
## [10] "MD, TX, NC, PA, MS"                                                
## [11] "NC"                                                                
## [12] "LA, MS, MD, NC"                                                    
## [13] "OR"                                                                
## [14] "VA"                                                                
## [15] "WY"                                                                
## [16] "GA"                                                                
## [17] "UT"                                                                
## [18] "PA"                                                                
## [19] "CA"                                                                
## [20] "MI"                                                                
## [21] "1"                                                                 
## [22] "NY"                                                                
## [23] "CO"                                                                
## [24] "MA"                                                                
## [25] "HI"                                                                
## [26] "AZ"                                                                
## [27] "MD"                                                                
## [28] "KY"                                                                
## [29] "OH"                                                                
## [30] "14 states: CA, UT, KS, WI, IL, IN, OH, GA, FL, NC, VA, WV, NY, PA,"
## [31] "LA"                                                                
## [32] "SC"                                                                
## [33] "VT"                                                                
## [34] "DC"                                                                
## [35] "WA"                                                                
## [36] "WV"

Some issues we noted previously: We need to remove the Unspecified entry in Hemisphere and recode Action1 and Path1 as described in the Data Analysis exercise, i.e., from Unknown to Unspecified. Also, we want to group the Setting_1 variable into just Restaurant and Other. Again, remember that there are restaurant and Restaurant values, so you need to fix that too.

We’ll also recode the gg2c4 blank entries to “No”. We further note that Action1 has a single No entry. Let’s remove that observation to prevent potential problems during cross-validation.

Let’s also lump country together, make 3 categories, Japan, USA, and Other.

As discussed previously, it makes sense to move the Waterborne to the Environmental in the Trans1 variable. It also turns out that most outbreaks have no information for state, so best to drop the State variable.

Finally, move the outcome into the first column.

## [1] "Northern" "Southern"
## [1] "Northern" "Southern"
## [1] "Unspecified" "Yes"
## [1] "No"          "Unknown"     "Unspecified" "Yes"
## [1] "No"          "Unspecified" "Yes"
##  Factor w/ 2 levels "Other","Restaurant": 1 1 1 2 1 2 1 2 1 1 ...
##  [1] "Australia"   "Ca da"       "Chi"         "Croatia"     "Denmark"    
##  [6] "France"      "Iraq"        "Israel"      "Italy"       "Japan"      
## [11] "Multiple"    "New Zealand" "Norway"      "Other"       "Scotland"   
## [16] "Spain"       "UK"          "Unspecified" "USA"
## [1] "Environmental"    "Foodborne"        "Person to Person"
## [4] "Unknown"          "Unspecified"      "Waterborne"
## [1] "Environmental"    "Foodborne"        "Person to Person"
## [4] "Unspecified"
## 
##                                                                  0 
##                                                                841 
##                                                                  1 
##                                                                  1 
## 14 states: CA, UT, KS, WI, IL, IN, OH, GA, FL, NC, VA, WV, NY, PA, 
##                                                                  1 
##                                                                 AK 
##                                                                  9 
##                                                                 AZ 
##                                                                  2 
##                                                                 CA 
##                                                                  2 
##                                                                 CO 
##                                                                  1 
##                                                                 DC 
##                                                                  1 
##                                                                 FL 
##                                                                  7 
##                                                                 GA 
##                                                                  2 
##                                                                 HI 
##                                                                  2 
##                                                                 KY 
##                                                                  1 
##                                                                 LA 
##                                                                  1 
##                                                     LA, MD, MS, NC 
##                                                                  1 
##                                                     LA, MS, MD, NC 
##                                                                  1 
##                                                                 MA 
##                                                                  3 
##                                                                 MD 
##                                                                  2 
##                                                 MD, TX, NC, PA, MS 
##                                                                  1 
##                                                                 MI 
##                                                                  6 
##                                                                 NC 
##                                                                 18 
##                                                             NC, FL 
##                                                                  1 
##                                                                 NY 
##                                                                 10 
##                                                                 OH 
##                                                                  3 
##                                                                 OR 
##                                                                  4 
##                                                                 PA 
##                                                                  3 
##                                                                 SC 
##                                                                  1 
##                                                                 TX 
##                                                                  5 
##                                                                 UT 
##                                                                  1 
##                                                                 VA 
##                                                                  6 
##                                                                 VT 
##                                                                  2 
##                                                                 WA 
##                                                                  1 
##                                                             WA, FL 
##                                                                  1 
##                                                                 WI 
##                                                                  5 
##                                                                 WV 
##                                                                  1 
##                                             WV, MD, FL, NY, PA, VA 
##                                                                  1 
##                                                                 WY 
##                                                                  4

Data visualization

We know our data fairly well by now, so we might not discover much new in any plots, but it’s always good to do them anyway. Let’s create a few plots showing the outcome and the predictors. We’ll start with the continuous predictors. I suggest scatter/box/violin plots with the outcome on the x-axis.

## Warning: Removed 305 rows containing non-finite values (stat_ydensity).

Things look ok, apart from the skew in the predictors we discussed previously.

Next, let’s create plots for the categorical variables. You can use, for instance, geom_count for it, or some other representation. If you prefer lots of tables, that’s ok too.

## Warning: attributes are not identical across measure variables;
## they will be dropped

Looks ok. Notice the NA for vomiting. We are ok with that for now.

## [1] 952  21
## Classes 'tbl_df', 'tbl' and 'data.frame':    952 obs. of  21 variables:
##  $ season          : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Action1         : Factor w/ 2 levels "Unspecified",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ CasesAll        : num  15 65 27 4 15 6 40 10 116 45 ...
##  $ Country         : Factor w/ 3 levels "Other","Japan",..: 2 3 1 1 1 1 1 1 1 3 ...
##  $ Deaths          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EndMonth        : int  12 9 0 0 0 0 0 0 11 11 ...
##  $ gg2c4           : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 1 1 2 1 ...
##  $ Hemisphere      : Factor w/ 2 levels "Northern","Southern": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Hospitalizations: num  0 0 0 0 0 0 0 0 5 10 ...
##  $ MeanD1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MeanI1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MedianD1        : num  0 36 0 0 0 0 0 0 0 48 ...
##  $ MedianI1        : num  0 37 0 0 0 0 0 0 0 31 ...
##  $ OBYear          : num  1999 1998 2006 2006 2006 ...
##  $ Path1           : Factor w/ 3 levels "No","Unspecified",..: 1 1 2 2 2 2 2 2 1 2 ...
##  $ RateAll         : num  0 39.8 20.8 100 60 ...
##  $ RiskAll         : num  0 108 130 4 25 ...
##  $ StartMonth      : int  11 9 9 10 11 11 11 11 11 11 ...
##  $ Trans1          : Factor w/ 4 levels "Environmental",..: 4 2 2 2 2 2 2 2 4 2 ...
##  $ Vomit           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Setting         : Factor w/ 2 levels "Other","Restaurant": 1 1 1 2 1 2 1 2 1 1 ...

At this step, you should have a dataframe containing 952 observations, and 21 variables: 1 outcome, 12 numeric/integer predictors, and 8 factor variables. All variables should have values that are ready for analysis. The outcome should be in the 1st slot.

Parallelization

Similar to mlr, caret allows you to use multiple processors at the same time. It is easy to set up, and the few lines below do the trick.

Model fitting

We’ll now explore fitting and training several tree-based models. We’ll also explore how including/excluding missing values and centering/scaling or not might affect the results.

A null model

To define a null model, we need to determine what performance measure we want to track. Since we now have a categorical outcome with more than 2 categories, the regular 2x2 table/confusion matrix, and measurements that rely on it don’t quite work, though many of them have versions that go beyond the 2x2 table. To keep things simple, we will use accuracy, which is simply the fraction of correct predictions. It is easy to compute, no matter how many categories we have. The null model still predicts the most frequent category. We can use that as baseline performance.

## 
##   Fall Spring Summer Winter 
##    115    156     88    309
## [1] 668
## [1] 0.4625749

You should find that the null model has an accuracy of around 0.46.

Single predictor models

Now let’s consider single predictor models, i.e., we’ll fit the outcome to each predictor one at a time to get an idea of the importance of individual predictors. Here, our model will be a tree. I’m actually not so sure if this makes a lot of sense since a “tree” with only one predictor seems a bit silly. But I guess we can try. It’s similar to a 1-predictor GLM.

We’ll also do some parameter tuning here. Looking at the caret documentation, we find that the tuning parameter for the rpart model (which is the tree algorithm) is called cp. We could also find that using modelLookup("rpart"). We could either specify a grid of values to try for cp (we’ll use a grid below), or, for a single tuning parameter, caret allows one to set the number of values to try and picks those values automatically. We’ll do the latter approach here.

##            Variable  Accuracy
## 1           Action1 0.4625783
## 2          CasesAll 0.4371258
## 3           Country 0.4523908
## 4            Deaths 0.4625782
## 5          EndMonth 0.6078051
## 6             gg2c4 0.4625783
## 7        Hemisphere 0.4562893
## 8  Hospitalizations 0.4607826
## 9            MeanD1 0.4631575
## 10           MeanI1 0.4649637
## 11         MedianD1 0.4577710
## 12         MedianI1 0.4548254
## 13           OBYear 0.4787562
## 14            Path1 0.4625805
## 15          RateAll 0.4574609
## 16          RiskAll 0.4661742
## 17       StartMonth 0.8921997
## 18           Trans1 0.4556872
## 19            Vomit 0.4625847
## 20          Setting 0.4625804

So it looks like most of the single predictor models don’t have accuracy much better than the null. 2 exceptions are StartMonth and EndMonth. Well, we would expect that the outbreak season and the month at which the outbreak started (and ended) have a strong correlation. I kept those variables here to see if that would happen and get some reassurance that our model works ok. Of course, in a real analysis, keeping those seems silly, we wouldn’t learn much from it (other than data entry appeared to have been done ok). Actually the data entry had 0’s for some of the months for both variables

Full model

Anyway, now let’s fit a tree to the full model with all predictors.

##            cp  Accuracy     Kappa AccuracySD    KappaSD
## 1  0.00000000 0.9733450 0.9608083 0.01419000 0.02099063
## 2  0.02135562 0.9727480 0.9598555 0.01463699 0.02178684
## 3  0.04271123 0.9248201 0.8900274 0.02820307 0.04147341
## 4  0.06406685 0.8921723 0.8424778 0.02326732 0.03397198
## 5  0.08542247 0.8921723 0.8424778 0.02326732 0.03397198
## 6  0.10677809 0.8921723 0.8424778 0.02326732 0.03397198
## 7  0.12813370 0.8921723 0.8424778 0.02326732 0.03397198
## 8  0.14948932 0.8921723 0.8424778 0.02326732 0.03397198
## 9  0.17084494 0.8921723 0.8424778 0.02326732 0.03397198
## 10 0.19220056 0.6630268 0.4042284 0.19975184 0.40034135

This printout shows us model performance for different values of the tuning parameter, cp. It seems for a high cp, we get a close to perfect model. Let’s take a look at this model. We could use the regular plot function, but the resulting tree looks ugly. The prp function from the rpart.plot package makes a much nicer tree (other packages to plot nice trees exist).

## quartz_off_screen 
##                 2

So the model splits on month, repeatedly. And it also splits on the hemisphere, which makes sense since the seasons are switched in the Southern hemisphere. This is also likely the reason why the model with month only didn’t produce a perfect fit. It could get one hemisphere right, but not both. With both bits of information, we can get almost perfect predictions.

Note two distinguishing features of this tree model: Even though we fed the algorithm the full model including all predictor variables, some - in fact, almost all - of them are not part of the final model. This is a good illustration of the feature/variable selection property that a tree does automatically. If it doesn’t find a certain predictor variable useful, it leaves it out.

Also, note that some predictor variables are used more than once. In fact, in this example, only 2 variables are used, and they are used repeatedly.

You can also see on each node the predictions for each outcome category at that step. It is always possible to build a tree that predicts perfectly, but that would lead to overfitting. That’s why we use cross-validation and tuning of the cp parameter, which reduces the risk of overfitting.

Finally, note that by default, levels are ordered alphabetically (Fall/Spring/Summer/Winter). We could re-order the factor manually in the more logical Spring/Summer/Fall/Winter order. I would do that for a ‘real’ analysis where I want to show a nice final result, but here we can leave it as is, as long as we pay attention to this fact.

Re-fitting the tree

So the above produces an almost perfect fit, and thus more powerful models are not really needed. However, using start (and end) month information seems a bit like cheating. Of course, if we give the model that kind of information, it will do well. Let’s make it harder and remove those 2 variables from both the training and test sets.

Also, further below, I had problems fitting some of the models, the NA caused problems. The issue often is that while the underlying algorithm can handle the missing values (as you saw above), when using wrappers like caret, things break down. I could either skip caret and try to access the different models directly. For now, I decided to go the other route and drop the data with the missing values. To be able to compare the different models below, I’m dropping those NA observations here. Since the data changed, we also need to re-do the null model computation to be able to compare properly.

## [1] 0.4594595

You should find a very similar null-model accuracy, around 0.46.

Now, let’s re-do the fit above fitting a single tree. I’ll increase the tuneLength value a bit so the algorithm can try a few more parameter values.

##            Variable Accuracy
## 1           Action1 0.973345
## 2          CasesAll 0.973345
## 3           Country 0.973345
## 4            Deaths 0.973345
## 5             gg2c4 0.973345
## 6        Hemisphere 0.973345
## 7  Hospitalizations 0.973345
## 8            MeanD1 0.973345
## 9            MeanI1 0.973345
## 10         MedianD1 0.973345
## 11         MedianI1 0.973345
## 12           OBYear 0.973345
## 13            Path1 0.973345
## 14          RateAll 0.973345
## 15          RiskAll 0.973345
## 16           Trans1 0.973345
## 17            Vomit 0.973345
## 18          Setting 0.973345
##             cp  Accuracy      Kappa AccuracySD    KappaSD
## 1  0.000000000 0.4349627 0.14956438 0.04807618 0.07191776
## 2  0.001286550 0.4356835 0.14991145 0.04746441 0.07171892
## 3  0.002573099 0.4443585 0.15723013 0.04900668 0.07321439
## 4  0.003859649 0.4515689 0.16392256 0.04696232 0.06916031
## 5  0.005146199 0.4601951 0.16942460 0.04717645 0.06641151
## 6  0.006432749 0.4695162 0.17066011 0.04200015 0.05953566
## 7  0.007719298 0.4763374 0.17016788 0.03717521 0.05259268
## 8  0.009005848 0.4795676 0.16626058 0.03885965 0.05283925
## 9  0.010292398 0.4820901 0.16681684 0.03850084 0.05092273
## 10 0.011578947 0.4828598 0.15790524 0.03389696 0.05164283
## 11 0.012865497 0.4857592 0.14753931 0.03144745 0.05295581
## 12 0.014152047 0.4846942 0.14631266 0.02989002 0.04826591
## 13 0.015438596 0.4825221 0.13794529 0.02972887 0.05587380
## 14 0.016725146 0.4810643 0.12423535 0.02766639 0.06382933
## 15 0.018011696 0.4821357 0.12420558 0.02845738 0.06357778
## 16 0.019298246 0.4807041 0.11557474 0.02787746 0.07155800
## 17 0.020584795 0.4796327 0.10953188 0.02833090 0.07474337
## 18 0.021871345 0.4746200 0.09045855 0.02537425 0.07109261
## 19 0.023157895 0.4731785 0.08490650 0.02291050 0.07080341
## 20 0.024444444 0.4674353 0.06044875 0.01713106 0.06811923

## quartz_off_screen 
##                 2

With those variables removed, the tree model doesn’t perform very well. Accuracy is similar to the null model. You can see that in the tree figure, the final nodes (leaves) show a lot of mis-predictions. Note again that only a few variables are used to build the tree, and OBYear shows up more than once.

Let’s see if we can get improved performance with more complicated models. That is, of course, not guaranteed. If there is no “signal” in the data, it doesn’t matter how complicated the model is. We won’t get anything predictive.

Random forest

Let’s try a random forest. We’ll use the ranger algorithm for this (it’s generally faster than the rf algorithm). This model has parameters that can and should be tuned. To do the tuning, we set up a grid of parameters and search over that grid. More efficient ways exist, e.g., doing an optimization using something like a genetic algorithm. The mlr package allows you to do that, caret doesn’t have those features out of the box, you would need to write your own code for that.

Note that running the code below (and several models that follow) might take a few seconds or minutes, depending on the speed of your computer.

We can’t plot a nice final tree anymore since the model is now a combination of trees. We can look at a plot of model performance as a function of the model tuning parameters.

This plot suggests that for around 4 randomly selected parameters and a minimum node size of 6, we get the best model. Note that there isn’t much difference in model performance for several different values of the tuning parameters, and the overall model isn’t that great either, an accuracy just shy of 0.5. I’m actually getting that at 4 randomly selected parameters, that with 5 minimal node size, I get the best model.

Boosted tree ensemble

Let’s now try a boosted regression tree ensemble. This method also has several parameters that need tuning, which are specified in gbmGrid.

We can again look at diagnostic fits and variable importance plots for this model, as well as check performance.

It doesn’t look like the boosted tree model performs any better. It may be that no information in the predictors strongly correlates with the season.

But let’s not give up yet, we’ll try a bit more. Let’s see if pre-processing helps.

Discriminant analysis

Ok, let’s try one more. Since the trees don’t seem to work too well, and a logistic model doesn’t work for multiple categorical outcomes, let’s use another model that can do that, namely a discriminant analysis. We’ve already seen one of those previously. Let’s pick a different one. The caret package website lists a lot of algorithms. I don’t know how exactly they all differ. Let’s try the one called penalized discriminant analysis (pda) (penalized is another way of saying regularized, and we learned that regularization might sometimes help.)

Ok, that doesn’t look very good either. Alright, enough models, there just doesn’t seem to be much information in the data that could help predict season. So let’s start to wrap up.

Comparing model performance

To recap: After we removed the obvious predictors, we fitted 1 tree, 1 random forest, 1 gradient boosted tree, 1 random forest with processed predictors, and 1 discriminant analysis. Let’s compare the performance.

This plot suggests that while none of the models are great, the random forest ones seem overall slightly better.

Evaluating the final model

Even though we didn’t really have any good model, and thus further evaluating it seems, in some sense pointless, for this exercise we’ll look at our “best” model anyway. We’ll declare the 1st random forest model (saved as fit2) to be the best. I’m not really seeing a difference between my random forests

Model uncertainty in the predictions is shown in the previous plot.

For categorical outcomes, we can’t plot residuals. But we can look at the final “2x2” (in this case, a 4x4) table (the confusion matrix) and compare true versus predicted outcomes and see if it shows any pattern.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fall Spring Summer Winter
##     Fall     79      3      3      1
##     Spring    1    112      1      0
##     Summer    0      0     64      0
##     Winter   14     18      5    254
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9171         
##                  95% CI : (0.891, 0.9387)
##     No Information Rate : 0.4595         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.876          
##                                          
##  Mcnemar's Test P-Value : 6.346e-07      
## 
## Statistics by Class:
## 
##                      Class: Fall Class: Spring Class: Summer Class: Winter
## Sensitivity               0.8404        0.8421        0.8767        0.9961
## Specificity               0.9848        0.9953        1.0000        0.8767
## Pos Pred Value            0.9186        0.9825        1.0000        0.8729
## Neg Pred Value            0.9680        0.9524        0.9817        0.9962
## Prevalence                0.1694        0.2396        0.1315        0.4595
## Detection Rate            0.1423        0.2018        0.1153        0.4577
## Detection Prevalence      0.1550        0.2054        0.1153        0.5243
## Balanced Accuracy         0.9126        0.9187        0.9384        0.9364

You should see that the model gets a lot right (high numbers on the diagonal), but predicts winter more often than it should (last row). Accuracy in the training data is pretty good.

But we know that performance on the training set is not that meaningful, especially with a complex model we always get fairly good performance on the data used to build the model. What matters is how it performs on new data, so let’s check that. We can look at the same for the test set, which is a better estimate of the true performance.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fall Spring Summer Winter
##     Fall     12     10      5     10
##     Spring    5      7      6     11
##     Summer    1      2      5      3
##     Winter   23     35     14     91
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4792          
##                  95% CI : (0.4145, 0.5444)
##     No Information Rate : 0.4792          
##     P-Value [Acc > NIR] : 0.5254          
##                                           
##                   Kappa : 0.1535          
##                                           
##  Mcnemar's Test P-Value : 2.432e-05       
## 
## Statistics by Class:
## 
##                      Class: Fall Class: Spring Class: Summer Class: Winter
## Sensitivity               0.2927       0.12963       0.16667        0.7913
## Specificity               0.8744       0.88172       0.97143        0.4240
## Pos Pred Value            0.3243       0.24138       0.45455        0.5583
## Neg Pred Value            0.8571       0.77725       0.89083        0.6883
## Prevalence                0.1708       0.22500       0.12500        0.4792
## Detection Rate            0.0500       0.02917       0.02083        0.3792
## Detection Prevalence      0.1542       0.12083       0.04583        0.6792
## Balanced Accuracy         0.5835       0.50568       0.56905        0.6077

Here, the confusion matrix doesn’t look good at all. Off-diagonal entries are larger than diagonal, accuracy is about as good as a null model, and agrees with the cross-validated results. That last point is good, that means our cross-validation routine properly estimates the performance of “new” data. Of course, even this estimate on test data is a bit optimistic, since real new data is usually collected under slightly different circumstances and with different protocols, making it more likely that model performance is reduced.

Wrapping up

Overall, this suggests there is just no “signal” in the variables we are trying to use to predict the outcome, and the model fits to “noise” - thus doing ok on training data but failing on the test/cross-validated performance evaluation.

We can conclude that this is an analysis where we walk away “empty handed”. If this were a real analysis and you had data and did a ‘let’s see if something is going on’ analysis, this would be somewhat disappointing. If you had a specific, plausible hypothesis before you started your analysis (e.g., that outbreak size is correlated with season), finding that your biologically reasonable hypothesis doesn’t seem to be right is useful. Unfortunately, null results are still hard to publish.

Note that we didn’t use p-values here, but I bet we would have found several that are statistically significant when evaluated on the training data (as they usually are). But they would be meaningless and essentially “fitting the noise”. I contend that a lot of what’s published out there is exactly that, spurious p-values because models are evaluated on the training data.

End notes

Above, I kept search grids relatively small and specified other things (e.g., number of trees) to make sure things run reasonably fast. For a real project, I would make my searches and tuning more exhaustive, which might mean waiting a few hours for the code to run. If you want to, try to increase the search space or try different models to see if you can get a better performing model. It doesn’t seem to me that the data has any meaningful signal that would lead to a great model with much-improved performance. But you are welcome to see if you can find a model that performs better. If you do get a model that performs well (>0.6 accuracy), let me know. I’d be curious to see it.