This document will guide you through some data analysis tasks with a focus on fitting tree-based models and training them. We’ll also (re)-visit some other topics.
While this is in some sense a stand-alone analysis, I assume that you have worked through the Data Analysis exercise and are familiar with the dataset and all the things we discovered during the cleaning process. We’ll use the same dataset here but focus on a different outcome. Other than that, the way to work through the exercise is like in the Data Analysis exercise, namely by writing/completing the missing code.
We need a variety of different packages, which are loaded here. Install as needed. For this analysis, we’ll again use the caret
package. If you use others, load them here.
We will again use the Norovirus dataset.
#Write code that loads the dataset
#You can of course re-use code you wrote in the other file.
d <- read_csv("norodata.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Author = col_character(),
## EpiCurve = col_character(),
## TDComment = col_character(),
## AHComment = col_character(),
## Trans1 = col_character(),
## Trans2 = col_character(),
## Trans2_O = col_character(),
## Trans3 = col_character(),
## Trans3_O = col_character(),
## Vehicle_1 = col_character(),
## Veh1 = col_character(),
## Veh1_D_1 = col_character(),
## Veh2 = col_character(),
## Veh2_D_1 = col_character(),
## Veh3 = col_character(),
## Veh3_D_1 = col_character(),
## PCRSect = col_character(),
## OBYear = col_character(),
## Hemisphere = col_character(),
## season = col_character()
## # ... with 44 more columns
## )
## See spec(...) for full column specifications.
## Warning: 2 parsing failures.
## row col expected actual file
## 1022 CD a double GGIIb 'norodata.csv'
## 1022 gge a double Sindlesham 'norodata.csv'
For this analysis, we consider as our main outcome of interest the season. This is a categorical outcome with more than 2 categories, something we haven’t looked at before. Because it’s more than 2 categories, a basic logistic model won’t work. Fortunately, tree-based models can deal with multiple categories.
## chr [1:1022] "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" ...
## Warning: Ignoring unknown parameters: binwidth, bins, pad
We already knew from previous explorations that some entries do not have a season. We could either code them as “other” and keep them in the model, or remove them. Since it’s hard to see any potential scientific reason why there should be a correlation between an “other” season and some variable, we’ll remove it here.
We also notice that while winter is dominant (makes sense, we know that norovirus is more common in winter), we got a decent number of outbreaks for each season, so we shouldn’t have a problem with (un)balanced data.
#write code that removes all observations that have an empty/missing value for season
#then drop the empty level and check that you have 4 categories for season left
d <- d %>%
filter(is.na(season) == FALSE)
d %>%
ggplot(aes(season)) +
geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
We will pick similar variables as previously, but with some changes. Keep the following variables: Action1, CasesAll, Country, Deaths, EndMonth, GG2C4, Hemisphere, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, OBYear, Path1, RateAll, RiskAll, Season, Setting, StartMonth, State, Trans1, Vomit.
We’ll have to perform the usual cleaning steps. You might have realized by now that even for the same dataset, cleaning steps can differ based on the outcome (and as you see below, the model).
Let’s first check for missing values.
## chr [1:955] "Yes" NA "Yes" NA "Yes" NA NA NA "Yes" NA NA NA NA NA NA ...
Looks like none of the new variables we included had a ton of missing, so we would probably be ok just removing any observation that has missing data. However, tree-based models can deal with missing data in predictors. Therefore, we’ll keep them for now. We’ll later compare how the model does or does not change if we remove those observations.
Let’s make sure everything has the right format (numeric/integer/factor). Adjust/recode variables as needed. You will likely find that as you convert OBYear
to numeric, something doesn’t quite work. Take a look. Fix by removing the observation with the troublesome entry, then convert to numeric. Finally, remove the observations that have 0 as OByear - there are more than 1 now.
#write code that cleans OBYear, convert it to numeric. Remove observations with OBYear = 0.
#also convert any other variables as needed
unique(d$OBYear)
## [1] "1999" "1998" "2006" "2004" "1993"
## [6] "2002" "2005" "1997" "2003" "1994"
## [11] "2008" "2000" "2001" "1995" "1996"
## [16] "2007" "2009" "1990" "0" "1983"
## [21] "2010" "1992" "2002-2007"
d <- d %>%
filter(OBYear != "2002-2007")
d <- d %>%
filter(OBYear != "0")
d$OBYear <- as.numeric(d$OBYear)
Look at the data to see what else we need to do.
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 953 obs. of 22 variables:
## $ Action1 : chr "Unspecified" "Unspecified" "Unspecified" "Unspecified" ...
## $ CasesAll : num 15 65 27 4 15 6 40 10 116 45 ...
## $ Country : chr "Japan" "USA" "Other" "Other" ...
## $ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EndMonth : num 12 9 0 0 0 0 0 0 11 11 ...
## $ gg2c4 : chr "Yes" "No" "Yes" "No" ...
## $ Hemisphere : chr "Northern" "Northern" "Northern" "Northern" ...
## $ Hospitalizations: num 0 0 0 0 0 0 0 0 5 10 ...
## $ MeanD1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MeanI1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MedianD1 : num 0 36 0 0 0 0 0 0 0 48 ...
## $ MedianI1 : num 0 37 0 0 0 0 0 0 0 31 ...
## $ OBYear : num 1999 1998 2006 2006 2006 ...
## $ Path1 : chr "No" "No" "Unspecified" "Unspecified" ...
## $ RateAll : num 0 39.8 20.8 100 60 ...
## $ RiskAll : num 0 108 130 4 25 ...
## $ season : chr "Fall" "Fall" "Fall" "Fall" ...
## $ Setting_1 : chr "Daycare Center" "Boxed lunch, football game" "buffet" "restaurant" ...
## $ StartMonth : num 11 9 9 10 11 11 11 11 11 11 ...
## $ State : chr "0" "NC, FL" "0" "0" ...
## $ Trans1 : chr "Unspecified" "Foodborne" "Foodborne" "Foodborne" ...
## $ Vomit : num 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 5 variables:
## ..$ row : int 1022 1022
## ..$ col : chr "CD" "gge"
## ..$ expected: chr "a double" "a double"
## ..$ actual : chr "GGIIb" "Sindlesham"
## ..$ file : chr "'norodata.csv'" "'norodata.csv'"
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. Author = col_character(),
## .. Pub_Year = col_double(),
## .. pubmedid = col_double(),
## .. EpiCurve = col_character(),
## .. TDComment = col_character(),
## .. AHComment = col_character(),
## .. Trans1 = col_character(),
## .. Trans1_O = col_double(),
## .. Trans2 = col_character(),
## .. Trans2_O = col_character(),
## .. Trans3 = col_character(),
## .. Trans3_O = col_character(),
## .. Risk1 = col_double(),
## .. Risk2 = col_double(),
## .. RiskAll = col_double(),
## .. Cases1 = col_double(),
## .. Cases2 = col_double(),
## .. CasesAll = col_double(),
## .. Rate1 = col_double(),
## .. Rate2 = col_double(),
## .. RateAll = col_double(),
## .. Hospitalizations = col_double(),
## .. Deaths = col_double(),
## .. Vehicle_1 = col_character(),
## .. Veh1 = col_character(),
## .. Veh1_D_1 = col_character(),
## .. Veh2 = col_character(),
## .. Veh2_D_1 = col_character(),
## .. Veh3 = col_character(),
## .. Veh3_D_1 = col_character(),
## .. PCRSect = col_character(),
## .. OBYear = col_character(),
## .. Hemisphere = col_character(),
## .. season = col_character(),
## .. MeanI1 = col_double(),
## .. MedianI1 = col_double(),
## .. Range_S_I1 = col_double(),
## .. Range_L_I1 = col_double(),
## .. MeanD1 = col_double(),
## .. MedianD1 = col_double(),
## .. Range_S_D1 = col_double(),
## .. Range_L_D1 = col_double(),
## .. MeanA1 = col_double(),
## .. MedianA1 = col_double(),
## .. Range_Y_A1 = col_character(),
## .. Range_O_A1 = col_double(),
## .. Action1 = col_character(),
## .. Action2_1 = col_character(),
## .. Secondary = col_character(),
## .. MeanI2 = col_double(),
## .. MedianI2 = col_double(),
## .. Range_S_I2 = col_double(),
## .. Range_L_I2 = col_double(),
## .. MeanD2 = col_double(),
## .. MedianD2 = col_double(),
## .. Range_S_D2 = col_double(),
## .. Range_L_D2 = col_double(),
## .. `Mea 2` = col_double(),
## .. `Media 2` = col_double(),
## .. Range_Y_A2 = col_double(),
## .. Range_O_A2 = col_double(),
## .. Comments_1 = col_character(),
## .. Path1 = col_character(),
## .. Path2_1 = col_character(),
## .. Country = col_character(),
## .. Category = col_character(),
## .. State = col_character(),
## .. Setting_1 = col_character(),
## .. StartMonth = col_double(),
## .. EndMonth = col_double(),
## .. GGA = col_double(),
## .. CA = col_double(),
## .. SA = col_character(),
## .. new_GGA = col_double(),
## .. new_CA = col_double(),
## .. new_SA = col_character(),
## .. SA_resolved_from = col_character(),
## .. GGB = col_double(),
## .. CB = col_character(),
## .. SB = col_character(),
## .. new_GGB = col_double(),
## .. new_CB = col_double(),
## .. new_SB = col_character(),
## .. SB_resolved_from = col_character(),
## .. GGC = col_double(),
## .. CC = col_double(),
## .. SC = col_character(),
## .. new_ggc = col_double(),
## .. new_cc = col_double(),
## .. new_sc = col_character(),
## .. SC_resolved_from = col_character(),
## .. GGD = col_double(),
## .. CD = col_double(),
## .. SD = col_character(),
## .. new_ggd = col_double(),
## .. new_cd = col_double(),
## .. new_sd = col_double(),
## .. SD_resolved_from = col_logical(),
## .. StrainOther = col_character(),
## .. strainother_rc = col_character(),
## .. gge = col_double(),
## .. ce = col_double(),
## .. se = col_character(),
## .. SE_resolved_from = col_character(),
## .. ggf = col_double(),
## .. cf = col_double(),
## .. sf = col_character(),
## .. ggg = col_double(),
## .. cg = col_double(),
## .. sg = col_character(),
## .. ggh = col_double(),
## .. ch = col_double(),
## .. sh = col_character(),
## .. ggi = col_double(),
## .. ci = col_double(),
## .. si = col_character(),
## .. ggj = col_double(),
## .. cj = col_double(),
## .. sj = col_character(),
## .. Country2 = col_character(),
## .. Veh1_D_2 = col_character(),
## .. Veh2_D_2 = col_character(),
## .. Veh3_D_2 = col_character(),
## .. Action2_2 = col_character(),
## .. Comments_2 = col_character(),
## .. Path2_2 = col_character(),
## .. Setting_2 = col_character(),
## .. category1 = col_character(),
## .. strainothergg2c4 = col_double(),
## .. gg2c4 = col_character(),
## .. Vomit = col_double(),
## .. IncInd = col_double(),
## .. SymInd = col_double(),
## .. PooledLat = col_double(),
## .. PooledSym = col_double(),
## .. PooledAge = col_double(),
## .. IndividualLatent = col_logical(),
## .. IndividualSymptomatic = col_character()
## .. )
## Action1 CasesAll Country Deaths
## Length:953 Min. : 1 Length:953 Min. :0.00000
## Class :character 1st Qu.: 9 Class :character 1st Qu.:0.00000
## Mode :character Median : 25 Mode :character Median :0.00000
## Mean : 129 Mean :0.05379
## 3rd Qu.: 64 3rd Qu.:0.00000
## Max. :32150 Max. :9.00000
## NA's :5 NA's :42
## EndMonth gg2c4 Hemisphere Hospitalizations
## Min. : 0.000 Length:953 Length:953 Min. : 0.0000
## 1st Qu.: 0.000 Class :character Class :character 1st Qu.: 0.0000
## Median : 0.000 Mode :character Mode :character Median : 0.0000
## Mean : 2.559 Mean : 0.7113
## 3rd Qu.: 4.000 3rd Qu.: 0.0000
## Max. :12.000 Max. :125.0000
## NA's :42
## MeanD1 MeanI1 MedianD1 MedianI1
## Min. : 0.000 Min. : 0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.0000 Median : 0.000 Median : 0.000
## Mean : 1.558 Mean : 0.7125 Mean : 2.611 Mean : 1.703
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :273.600 Max. :48.0000 Max. :235.200 Max. :65.000
##
## OBYear Path1 RateAll RiskAll
## Min. :1983 Length:953 Min. : 0.00 Min. : 0.0
## 1st Qu.:2000 Class :character 1st Qu.: 0.00 1st Qu.: 0.0
## Median :2003 Mode :character Median : 16.50 Median : 20.5
## Mean :2002 Mean : 27.04 Mean : 399.7
## 3rd Qu.:2005 3rd Qu.: 49.00 3rd Qu.: 110.8
## Max. :2010 Max. :105.00 Max. :35000.0
## NA's :103 NA's :115
## season Setting_1 StartMonth State
## Length:953 Length:953 Min. : 0.000 Length:953
## Class :character Class :character 1st Qu.: 2.000 Class :character
## Mode :character Mode :character Median : 5.000 Mode :character
## Mean : 5.894
## 3rd Qu.:10.000
## Max. :12.000
##
## Trans1 Vomit
## Length:953 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :1.0000
## Mean :0.5074
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :1
## [1] "0"
## [2] "NC, FL"
## [3] "FL"
## [4] "LA, MD, MS, NC"
## [5] "AK"
## [6] "WI"
## [7] "WA, FL"
## [8] "TX"
## [9] "WV, MD, FL, NY, PA, VA"
## [10] "MD, TX, NC, PA, MS"
## [11] "NC"
## [12] "LA, MS, MD, NC"
## [13] "OR"
## [14] "VA"
## [15] "WY"
## [16] "GA"
## [17] "UT"
## [18] "PA"
## [19] "CA"
## [20] "MI"
## [21] "1"
## [22] "NY"
## [23] "CO"
## [24] "MA"
## [25] "HI"
## [26] "AZ"
## [27] "MD"
## [28] "KY"
## [29] "OH"
## [30] "14 states: CA, UT, KS, WI, IL, IN, OH, GA, FL, NC, VA, WV, NY, PA,"
## [31] "LA"
## [32] "SC"
## [33] "VT"
## [34] "DC"
## [35] "WA"
## [36] "WV"
Some issues we noted previously: We need to remove the Unspecified
entry in Hemisphere
and recode Action1
and Path1
as described in the Data Analysis exercise, i.e., from Unknown
to Unspecified
. Also, we want to group the Setting_1
variable into just Restaurant
and Other
. Again, remember that there are restaurant
and Restaurant
values, so you need to fix that too.
We’ll also recode the gg2c4 blank entries to “No”. We further note that Action1
has a single No
entry. Let’s remove that observation to prevent potential problems during cross-validation.
Let’s also lump country together, make 3 categories, Japan, USA, and Other.
As discussed previously, it makes sense to move the Waterborne to the Environmental in the Trans1
variable. It also turns out that most outbreaks have no information for state, so best to drop the State
variable.
Finally, move the outcome into the first column.
# write code that performs the actions described above
#remove the Unspecified entry in Hemisphere. There was no unspecified entry, so I just ended up changing it to a factor.
unique(d$Hemisphere)
## [1] "Northern" "Southern"
## [1] "Northern" "Southern"
#recode Action1 and Path1:
d$Action1 <- as.factor(d$Action1)
d$Action1 <- fct_collapse(d$Action1, Unspecified = c("Unknown", "Unspecified"), Yes = "Yes", No = "No")
d <- d %>%
filter(Action1 != "No")
d$Action1 <- droplevels(d$Action1)
levels(d$Action1)
## [1] "Unspecified" "Yes"
## [1] "No" "Unknown" "Unspecified" "Yes"
d$Path1 <- fct_collapse(d$Path1, Unspecified = c("Unknown", "Unspecified"), Yes = "Yes", No = "No")
levels(d$Path1)
## [1] "No" "Unspecified" "Yes"
#Group setting_1 into only two factors
d <- d %>%
mutate(Setting = ifelse(stringr::str_detect(d$Setting_1, "[R|r]est*")== TRUE, "Restaurant", "Other"))
d <- d %>%
select(-Setting_1)
d$Setting <- as.factor(d$Setting)
str(d$Setting)
## Factor w/ 2 levels "Other","Restaurant": 1 1 1 2 1 2 1 2 1 1 ...
## [1] "Australia" "Ca da" "Chi" "Croatia" "Denmark"
## [6] "France" "Iraq" "Israel" "Italy" "Japan"
## [11] "Multiple" "New Zealand" "Norway" "Other" "Scotland"
## [16] "Spain" "UK" "Unspecified" "USA"
d$Country <- fct_collapse(d$Country, Other = c("Australia", "Ca da", "Chi", "Croatia", "Denmark", "France", "Iraq", "Israel", "Italy", "Multiple", "New Zealand", "Norway", "Other", "Scotland", "Spain", "UK", "Unspecified"), USA = "USA", Japan = "Japan")
#Combine Environmental and Waterborne transmission --- I also moved Unknown and Unspecified into the same level of Unspecified
d$Trans1 <- as.factor(d$Trans1)
levels(d$Trans1)
## [1] "Environmental" "Foodborne" "Person to Person"
## [4] "Unknown" "Unspecified" "Waterborne"
d$Trans1 <- fct_collapse(d$Trans1, Environmental = c("Environmental", "Waterborne"), Unspecified = c("Unknown", "Unspecified"))
levels(d$Trans1)
## [1] "Environmental" "Foodborne" "Person to Person"
## [4] "Unspecified"
##
## 0
## 841
## 1
## 1
## 14 states: CA, UT, KS, WI, IL, IN, OH, GA, FL, NC, VA, WV, NY, PA,
## 1
## AK
## 9
## AZ
## 2
## CA
## 2
## CO
## 1
## DC
## 1
## FL
## 7
## GA
## 2
## HI
## 2
## KY
## 1
## LA
## 1
## LA, MD, MS, NC
## 1
## LA, MS, MD, NC
## 1
## MA
## 3
## MD
## 2
## MD, TX, NC, PA, MS
## 1
## MI
## 6
## NC
## 18
## NC, FL
## 1
## NY
## 10
## OH
## 3
## OR
## 4
## PA
## 3
## SC
## 1
## TX
## 5
## UT
## 1
## VA
## 6
## VT
## 2
## WA
## 1
## WA, FL
## 1
## WI
## 5
## WV
## 1
## WV, MD, FL, NY, PA, VA
## 1
## WY
## 4
d <- d %>%
select(-State)
#change the remaining character variables into factor variables
d$gg2c4 <- as.factor(d$gg2c4)
d$season <- as.factor(d$season)
d$Vomit <- as.factor(d$Vomit)
#replace EndMonth and StartMonth 0 with "Unspecified", change into factor.
d$EndMonth <- as.integer(d$EndMonth)
d$StartMonth <- as.integer(d$StartMonth)
#Move the outcome, season, to the first column
d <- d[c(17, 1:16, 18:21)]
#At the end drop empty levels.
d <- droplevels(d)
visdat::vis_dat(d)
numeric <- d %>%
select(c(CasesAll, Deaths, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, RiskAll, RateAll, OBYear, StartMonth, EndMonth))
visdat::vis_cor(numeric)
We know our data fairly well by now, so we might not discover much new in any plots, but it’s always good to do them anyway. Let’s create a few plots showing the outcome and the predictors. We’ll start with the continuous predictors. I suggest scatter/box/violin plots with the outcome on the x-axis.
#write code that produces plots showing our outcome of interest on the x-axis and each numeric predictor on the y-axis.
#you can use the facet_wrap functionality in ggplot for it, or do it some other way.
d %>%
gather(c(CasesAll, Deaths, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, RiskAll, RateAll, OBYear, StartMonth, EndMonth), key = "var", value = "value") %>%
ggplot(aes(x = season, y = value)) +
geom_violin() +
facet_wrap( ~ var, scales = "free_y")
## Warning: Removed 305 rows containing non-finite values (stat_ydensity).
Things look ok, apart from the skew in the predictors we discussed previously.
Next, let’s create plots for the categorical variables. You can use, for instance, geom_count
for it, or some other representation. If you prefer lots of tables, that’s ok too.
#write code that produces plots or tables showing our outcome of interest and each categorical predictor.
d %>%
gather(c(Action1, Country, gg2c4, Hemisphere, Path1, Trans1, Vomit, Setting), key = "var", value = "value") %>%
ggplot(aes(x = season, y = value)) +
geom_count() +
facet_wrap( ~ var, scales = "free")
## Warning: attributes are not identical across measure variables;
## they will be dropped
Looks ok. Notice the NA for vomiting. We are ok with that for now.
## [1] 952 21
## Classes 'tbl_df', 'tbl' and 'data.frame': 952 obs. of 21 variables:
## $ season : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Action1 : Factor w/ 2 levels "Unspecified",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ CasesAll : num 15 65 27 4 15 6 40 10 116 45 ...
## $ Country : Factor w/ 3 levels "Other","Japan",..: 2 3 1 1 1 1 1 1 1 3 ...
## $ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EndMonth : int 12 9 0 0 0 0 0 0 11 11 ...
## $ gg2c4 : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 1 1 2 1 ...
## $ Hemisphere : Factor w/ 2 levels "Northern","Southern": 1 1 1 1 1 1 1 1 1 1 ...
## $ Hospitalizations: num 0 0 0 0 0 0 0 0 5 10 ...
## $ MeanD1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MeanI1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MedianD1 : num 0 36 0 0 0 0 0 0 0 48 ...
## $ MedianI1 : num 0 37 0 0 0 0 0 0 0 31 ...
## $ OBYear : num 1999 1998 2006 2006 2006 ...
## $ Path1 : Factor w/ 3 levels "No","Unspecified",..: 1 1 2 2 2 2 2 2 1 2 ...
## $ RateAll : num 0 39.8 20.8 100 60 ...
## $ RiskAll : num 0 108 130 4 25 ...
## $ StartMonth : int 11 9 9 10 11 11 11 11 11 11 ...
## $ Trans1 : Factor w/ 4 levels "Environmental",..: 4 2 2 2 2 2 2 2 4 2 ...
## $ Vomit : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ Setting : Factor w/ 2 levels "Other","Restaurant": 1 1 1 2 1 2 1 2 1 1 ...
At this step, you should have a dataframe containing 952 observations, and 21 variables: 1 outcome, 12 numeric/integer predictors, and 8 factor variables. All variables should have values that are ready for analysis. The outcome should be in the 1st slot.
Let’s do data splitting again. Use caret
functions (or any other way you like) to split the data into 70/30 train/test portions.
Similar to mlr
, caret
allows you to use multiple processors at the same time. It is easy to set up, and the few lines below do the trick.
We’ll now explore fitting and training several tree-based models. We’ll also explore how including/excluding missing values and centering/scaling or not might affect the results.
To define a null model, we need to determine what performance measure we want to track. Since we now have a categorical outcome with more than 2 categories, the regular 2x2 table/confusion matrix, and measurements that rely on it don’t quite work, though many of them have versions that go beyond the 2x2 table. To keep things simple, we will use accuracy, which is simply the fraction of correct predictions. It is easy to compute, no matter how many categories we have. The null model still predicts the most frequent category. We can use that as baseline performance.
#write code that computes accuracy for a null model that always predicts the most common category
table(data_train$season) #winter is the most common category with 309 observations
##
## Fall Spring Summer Winter
## 115 156 88 309
## [1] 668
## [1] 0.4625749
You should find that the null model has an accuracy of around 0.46.
Now let’s consider single predictor models, i.e., we’ll fit the outcome to each predictor one at a time to get an idea of the importance of individual predictors. Here, our model will be a tree. I’m actually not so sure if this makes a lot of sense since a “tree” with only one predictor seems a bit silly. But I guess we can try. It’s similar to a 1-predictor GLM.
We’ll also do some parameter tuning here. Looking at the caret documentation, we find that the tuning parameter for the rpart
model (which is the tree algorithm) is called cp
. We could also find that using modelLookup("rpart")
. We could either specify a grid of values to try for cp
(we’ll use a grid below), or, for a single tuning parameter, caret
allows one to set the number of values to try and picks those values automatically. We’ll do the latter approach here.
#There is probably a nicer tidyverse way of doing this. I just couldn't think of it, so did it this way.
set.seed(1111) #makes each code block reproducible
outcomename = "season"
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5) #setting CV method for caret
Npred <- ncol(data_train)-1 # number of predictors
resultmat <- data.frame(Variable = names(data_train)[-1], Accuracy = rep(0,Npred)) #store performance for each variable
for (n in 2:ncol(data_train)) #loop over each predictor. For this to work, outcome must be in 1st column
{
fit1 <- train( as.formula(paste(outcomename, "~",names(data_train)[n])) , data = data_train, method = "rpart", trControl = fitControl, na.action = na.pass, tuneLength = 10)
resultmat[n-1,2]= max(fit1$results$Accuracy)
}
print(resultmat)
## Variable Accuracy
## 1 Action1 0.4625783
## 2 CasesAll 0.4371258
## 3 Country 0.4523908
## 4 Deaths 0.4625782
## 5 EndMonth 0.6078051
## 6 gg2c4 0.4625783
## 7 Hemisphere 0.4562893
## 8 Hospitalizations 0.4607826
## 9 MeanD1 0.4631575
## 10 MeanI1 0.4649637
## 11 MedianD1 0.4577710
## 12 MedianI1 0.4548254
## 13 OBYear 0.4787562
## 14 Path1 0.4625805
## 15 RateAll 0.4574609
## 16 RiskAll 0.4661742
## 17 StartMonth 0.8921997
## 18 Trans1 0.4556872
## 19 Vomit 0.4625847
## 20 Setting 0.4625804
So it looks like most of the single predictor models don’t have accuracy much better than the null. 2 exceptions are StartMonth
and EndMonth
. Well, we would expect that the outbreak season and the month at which the outbreak started (and ended) have a strong correlation. I kept those variables here to see if that would happen and get some reassurance that our model works ok. Of course, in a real analysis, keeping those seems silly, we wouldn’t learn much from it (other than data entry appeared to have been done ok). Actually the data entry had 0’s for some of the months for both variables
Anyway, now let’s fit a tree to the full model with all predictors.
set.seed(1111) #makes each code block reproducible
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)
fit1 = train(season ~ ., data=data_train, method="rpart", trControl = fitControl, na.action = na.pass, tuneLength = 10)
print(fit1$results)
## cp Accuracy Kappa AccuracySD KappaSD
## 1 0.00000000 0.9733450 0.9608083 0.01419000 0.02099063
## 2 0.02135562 0.9727480 0.9598555 0.01463699 0.02178684
## 3 0.04271123 0.9248201 0.8900274 0.02820307 0.04147341
## 4 0.06406685 0.8921723 0.8424778 0.02326732 0.03397198
## 5 0.08542247 0.8921723 0.8424778 0.02326732 0.03397198
## 6 0.10677809 0.8921723 0.8424778 0.02326732 0.03397198
## 7 0.12813370 0.8921723 0.8424778 0.02326732 0.03397198
## 8 0.14948932 0.8921723 0.8424778 0.02326732 0.03397198
## 9 0.17084494 0.8921723 0.8424778 0.02326732 0.03397198
## 10 0.19220056 0.6630268 0.4042284 0.19975184 0.40034135
This printout shows us model performance for different values of the tuning parameter, cp. It seems for a high cp, we get a close to perfect model. Let’s take a look at this model. We could use the regular plot
function, but the resulting tree looks ugly. The prp
function from the rpart.plot
package makes a much nicer tree (other packages to plot nice trees exist).
ww=17.8/2.54; wh=ww; #for saving plot
dev.print(device=png,width=ww,height=wh,units="in",res=600,file="rparttree.png") #save tree to file
## quartz_off_screen
## 2
So the model splits on month, repeatedly. And it also splits on the hemisphere, which makes sense since the seasons are switched in the Southern hemisphere. This is also likely the reason why the model with month only didn’t produce a perfect fit. It could get one hemisphere right, but not both. With both bits of information, we can get almost perfect predictions.
Note two distinguishing features of this tree model: Even though we fed the algorithm the full model including all predictor variables, some - in fact, almost all - of them are not part of the final model. This is a good illustration of the feature/variable selection property that a tree does automatically. If it doesn’t find a certain predictor variable useful, it leaves it out.
Also, note that some predictor variables are used more than once. In fact, in this example, only 2 variables are used, and they are used repeatedly.
You can also see on each node the predictions for each outcome category at that step. It is always possible to build a tree that predicts perfectly, but that would lead to overfitting. That’s why we use cross-validation and tuning of the cp
parameter, which reduces the risk of overfitting.
Finally, note that by default, levels are ordered alphabetically (Fall/Spring/Summer/Winter). We could re-order the factor manually in the more logical Spring/Summer/Fall/Winter order. I would do that for a ‘real’ analysis where I want to show a nice final result, but here we can leave it as is, as long as we pay attention to this fact.
So the above produces an almost perfect fit, and thus more powerful models are not really needed. However, using start (and end) month information seems a bit like cheating. Of course, if we give the model that kind of information, it will do well. Let’s make it harder and remove those 2 variables from both the training and test sets.
Also, further below, I had problems fitting some of the models, the NA caused problems. The issue often is that while the underlying algorithm can handle the missing values (as you saw above), when using wrappers like caret
, things break down. I could either skip caret
and try to access the different models directly. For now, I decided to go the other route and drop the data with the missing values. To be able to compare the different models below, I’m dropping those NA observations here. Since the data changed, we also need to re-do the null model computation to be able to compare properly.
#write code that removes StartMonth and EndMonth from both training and test sets
# then drop all observations with missing values in both trian and test sets
# copy and paste the code from above that computes performance of a null model
data_train <- data_train %>%
select(-(c(StartMonth, EndMonth))) %>%
na.omit()
data_test <- data_test %>%
select(-c(StartMonth, EndMonth)) %>%
na.omit()
train <- as.matrix(summary(data_train$season))
null_model_2 <- max(train)/(length(data_train$season))
null_model_2
## [1] 0.4594595
You should find a very similar null-model accuracy, around 0.46.
Now, let’s re-do the fit above fitting a single tree. I’ll increase the tuneLength value a bit so the algorithm can try a few more parameter values.
#copy and paste the code from above that fits the single tree. set tuneLength to 20.
# look at model performance for different cp values. Also plot the tree.
set.seed(1111) #makes each code block reproducible
outcomename = "season"
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5) #setting CV method for caret
Npred <- ncol(data_train)-1 # number of predictors
resultmat <- data.frame(Variable = names(data_train)[-1], Accuracy = rep(0,Npred)) #store performance for each variable
for (n in 2:ncol(data_train)) #loop over each predictor. For this to work, outcome must be in 1st column
{
fit1_clean <- train( as.formula(paste(outcomename, "~",names(data_train)[n])) , data = data_train, method = "rpart", trControl = fitControl, na.action = na.pass, tuneLength = 20)
resultmat[n-1,2]= max(fit1$results$Accuracy)
}
print(resultmat)
## Variable Accuracy
## 1 Action1 0.973345
## 2 CasesAll 0.973345
## 3 Country 0.973345
## 4 Deaths 0.973345
## 5 gg2c4 0.973345
## 6 Hemisphere 0.973345
## 7 Hospitalizations 0.973345
## 8 MeanD1 0.973345
## 9 MeanI1 0.973345
## 10 MedianD1 0.973345
## 11 MedianI1 0.973345
## 12 OBYear 0.973345
## 13 Path1 0.973345
## 14 RateAll 0.973345
## 15 RiskAll 0.973345
## 16 Trans1 0.973345
## 17 Vomit 0.973345
## 18 Setting 0.973345
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)
fit1_clean = train(season ~ ., data=data_train, method="rpart", trControl = fitControl, na.action = na.pass, tuneLength = 20)
print(fit1_clean$results)
## cp Accuracy Kappa AccuracySD KappaSD
## 1 0.000000000 0.4349627 0.14956438 0.04807618 0.07191776
## 2 0.001286550 0.4356835 0.14991145 0.04746441 0.07171892
## 3 0.002573099 0.4443585 0.15723013 0.04900668 0.07321439
## 4 0.003859649 0.4515689 0.16392256 0.04696232 0.06916031
## 5 0.005146199 0.4601951 0.16942460 0.04717645 0.06641151
## 6 0.006432749 0.4695162 0.17066011 0.04200015 0.05953566
## 7 0.007719298 0.4763374 0.17016788 0.03717521 0.05259268
## 8 0.009005848 0.4795676 0.16626058 0.03885965 0.05283925
## 9 0.010292398 0.4820901 0.16681684 0.03850084 0.05092273
## 10 0.011578947 0.4828598 0.15790524 0.03389696 0.05164283
## 11 0.012865497 0.4857592 0.14753931 0.03144745 0.05295581
## 12 0.014152047 0.4846942 0.14631266 0.02989002 0.04826591
## 13 0.015438596 0.4825221 0.13794529 0.02972887 0.05587380
## 14 0.016725146 0.4810643 0.12423535 0.02766639 0.06382933
## 15 0.018011696 0.4821357 0.12420558 0.02845738 0.06357778
## 16 0.019298246 0.4807041 0.11557474 0.02787746 0.07155800
## 17 0.020584795 0.4796327 0.10953188 0.02833090 0.07474337
## 18 0.021871345 0.4746200 0.09045855 0.02537425 0.07109261
## 19 0.023157895 0.4731785 0.08490650 0.02291050 0.07080341
## 20 0.024444444 0.4674353 0.06044875 0.01713106 0.06811923
ww=17.8/2.54; wh=ww; #for saving plot
dev.print(device=png,width=ww,height=wh,units="in",res=600,file="rparttree.png") #save tree to file
## quartz_off_screen
## 2
With those variables removed, the tree model doesn’t perform very well. Accuracy is similar to the null model. You can see that in the tree figure, the final nodes (leaves) show a lot of mis-predictions. Note again that only a few variables are used to build the tree, and OBYear shows up more than once.
Let’s see if we can get improved performance with more complicated models. That is, of course, not guaranteed. If there is no “signal” in the data, it doesn’t matter how complicated the model is. We won’t get anything predictive.
Let’s try a random forest. We’ll use the ranger
algorithm for this (it’s generally faster than the rf
algorithm). This model has parameters that can and should be tuned. To do the tuning, we set up a grid of parameters and search over that grid. More efficient ways exist, e.g., doing an optimization using something like a genetic algorithm. The mlr
package allows you to do that, caret
doesn’t have those features out of the box, you would need to write your own code for that.
Note that running the code below (and several models that follow) might take a few seconds or minutes, depending on the speed of your computer.
set.seed(1111) #makes each code block reproducible
tuning_grid <- expand.grid( .mtry = seq(1,7,by=1), .splitrule = "gini", .min.node.size = seq(2,8,by=1) )
fit2 = train(season ~ ., data=data_train, method="ranger", trControl = fitControl, tuneGrid = tuning_grid, na.action = na.pass)
We can’t plot a nice final tree anymore since the model is now a combination of trees. We can look at a plot of model performance as a function of the model tuning parameters.
This plot suggests that for around 4 randomly selected parameters and a minimum node size of 6, we get the best model. Note that there isn’t much difference in model performance for several different values of the tuning parameters, and the overall model isn’t that great either, an accuracy just shy of 0.5. I’m actually getting that at 4 randomly selected parameters, that with 5 minimal node size, I get the best model.
Let’s now try a boosted regression tree ensemble. This method also has several parameters that need tuning, which are specified in gbmGrid
.
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2), n.trees = 300, shrinkage = c(0.1, 0.01), n.minobsinnode = c(2,4,6))
fit3 = train(season ~ ., data=data_train, method="gbm", trControl = fitControl, verbose=FALSE, tuneGrid = gbmGrid)
We can again look at diagnostic fits and variable importance plots for this model, as well as check performance.
It doesn’t look like the boosted tree model performs any better. It may be that no information in the predictors strongly correlates with the season.
But let’s not give up yet, we’ll try a bit more. Let’s see if pre-processing helps.
Here, we’ll fit another random forest model but now use centering and scaling for the predictors.
# copy the random forest code from above. Add a statement to the train() function that centers and scales predictors.
# save the result as fit4. plot model results.
set.seed(1111) #makes each code block reproducible
tuning_grid <- expand.grid( .mtry = seq(1,7,by=1), .splitrule = "gini", .min.node.size = seq(2,8,by=1) )
fit4 = train(season ~ ., data=data_train, method="ranger", trControl = fitControl, tuneGrid = tuning_grid, na.action = na.pass, preProcess = c("center", "scale"))
plot(fit4)
It doesn’t look like it helped a lot.
Ok, let’s try one more. Since the trees don’t seem to work too well, and a logistic model doesn’t work for multiple categorical outcomes, let’s use another model that can do that, namely a discriminant analysis. We’ve already seen one of those previously. Let’s pick a different one. The caret package website lists a lot of algorithms. I don’t know how exactly they all differ. Let’s try the one called penalized discriminant analysis (pda) (penalized is another way of saying regularized, and we learned that regularization might sometimes help.)
#write code that trains a pda model, use tuneLength 20. Save as fit5 and plot.
set.seed(1111)
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)
fit5 = train(season ~ ., data=data_train, method="pda", trControl = fitControl, na.action = na.pass, tuneLength = 20)
plot(fit5)
Ok, that doesn’t look very good either. Alright, enough models, there just doesn’t seem to be much information in the data that could help predict season. So let’s start to wrap up.
To recap: After we removed the obvious predictors, we fitted 1 tree, 1 random forest, 1 gradient boosted tree, 1 random forest with processed predictors, and 1 discriminant analysis. Let’s compare the performance.
resamps <- resamples(list(tree = fit1,
RF1 = fit2,
GBM = fit3,
RF2 = fit4,
PDA = fit5))
bwplot(resamps)
This plot suggests that while none of the models are great, the random forest ones seem overall slightly better.
Even though we didn’t really have any good model, and thus further evaluating it seems, in some sense pointless, for this exercise we’ll look at our “best” model anyway. We’ll declare the 1st random forest model (saved as fit2
) to be the best. I’m not really seeing a difference between my random forests
Model uncertainty in the predictions is shown in the previous plot.
For categorical outcomes, we can’t plot residuals. But we can look at the final “2x2” (in this case, a 4x4) table (the confusion matrix) and compare true versus predicted outcomes and see if it shows any pattern.
#Write code to get model predictions for the outcome on the training data.
#use predicted and actual outcomes, make a table and compute accuracy.
#Write code to get model predictions for the outcome on the training data.
#use predicted and actual outcomes, make a table and compute accuracy.
#fit2 is the RF1 model
prediction <- predict(fit2, newdata= data_train, interval="confidence")
data_train$prediction <- prediction
data_train$prediction <- as.factor(data_train$prediction)
confusionMatrix(data = data_train$prediction, reference = data_train$season)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fall Spring Summer Winter
## Fall 79 3 3 1
## Spring 1 112 1 0
## Summer 0 0 64 0
## Winter 14 18 5 254
##
## Overall Statistics
##
## Accuracy : 0.9171
## 95% CI : (0.891, 0.9387)
## No Information Rate : 0.4595
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.876
##
## Mcnemar's Test P-Value : 6.346e-07
##
## Statistics by Class:
##
## Class: Fall Class: Spring Class: Summer Class: Winter
## Sensitivity 0.8404 0.8421 0.8767 0.9961
## Specificity 0.9848 0.9953 1.0000 0.8767
## Pos Pred Value 0.9186 0.9825 1.0000 0.8729
## Neg Pred Value 0.9680 0.9524 0.9817 0.9962
## Prevalence 0.1694 0.2396 0.1315 0.4595
## Detection Rate 0.1423 0.2018 0.1153 0.4577
## Detection Prevalence 0.1550 0.2054 0.1153 0.5243
## Balanced Accuracy 0.9126 0.9187 0.9384 0.9364
You should see that the model gets a lot right (high numbers on the diagonal), but predicts winter more often than it should (last row). Accuracy in the training data is pretty good.
But we know that performance on the training set is not that meaningful, especially with a complex model we always get fairly good performance on the data used to build the model. What matters is how it performs on new data, so let’s check that. We can look at the same for the test set, which is a better estimate of the true performance.
#copy and paste the code from above, but now do it for the test set.
prediction <- predict(fit2, newdata= data_test, interval="confidence")
data_test$prediction <- prediction
confusionMatrix(data = data_test$prediction, reference = data_test$season)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fall Spring Summer Winter
## Fall 12 10 5 10
## Spring 5 7 6 11
## Summer 1 2 5 3
## Winter 23 35 14 91
##
## Overall Statistics
##
## Accuracy : 0.4792
## 95% CI : (0.4145, 0.5444)
## No Information Rate : 0.4792
## P-Value [Acc > NIR] : 0.5254
##
## Kappa : 0.1535
##
## Mcnemar's Test P-Value : 2.432e-05
##
## Statistics by Class:
##
## Class: Fall Class: Spring Class: Summer Class: Winter
## Sensitivity 0.2927 0.12963 0.16667 0.7913
## Specificity 0.8744 0.88172 0.97143 0.4240
## Pos Pred Value 0.3243 0.24138 0.45455 0.5583
## Neg Pred Value 0.8571 0.77725 0.89083 0.6883
## Prevalence 0.1708 0.22500 0.12500 0.4792
## Detection Rate 0.0500 0.02917 0.02083 0.3792
## Detection Prevalence 0.1542 0.12083 0.04583 0.6792
## Balanced Accuracy 0.5835 0.50568 0.56905 0.6077
Here, the confusion matrix doesn’t look good at all. Off-diagonal entries are larger than diagonal, accuracy is about as good as a null model, and agrees with the cross-validated results. That last point is good, that means our cross-validation routine properly estimates the performance of “new” data. Of course, even this estimate on test data is a bit optimistic, since real new data is usually collected under slightly different circumstances and with different protocols, making it more likely that model performance is reduced.
# shut down the parallel computing cluster we created at the beginning of the analysis
stopCluster(cl)
Overall, this suggests there is just no “signal” in the variables we are trying to use to predict the outcome, and the model fits to “noise” - thus doing ok on training data but failing on the test/cross-validated performance evaluation.
We can conclude that this is an analysis where we walk away “empty handed”. If this were a real analysis and you had data and did a ‘let’s see if something is going on’ analysis, this would be somewhat disappointing. If you had a specific, plausible hypothesis before you started your analysis (e.g., that outbreak size is correlated with season), finding that your biologically reasonable hypothesis doesn’t seem to be right is useful. Unfortunately, null results are still hard to publish.
Note that we didn’t use p-values here, but I bet we would have found several that are statistically significant when evaluated on the training data (as they usually are). But they would be meaningless and essentially “fitting the noise”. I contend that a lot of what’s published out there is exactly that, spurious p-values because models are evaluated on the training data.
Above, I kept search grids relatively small and specified other things (e.g., number of trees) to make sure things run reasonably fast. For a real project, I would make my searches and tuning more exhaustive, which might mean waiting a few hours for the code to run. If you want to, try to increase the search space or try different models to see if you can get a better performing model. It doesn’t seem to me that the data has any meaningful signal that would lead to a great model with much-improved performance. But you are welcome to see if you can find a model that performs better. If you do get a model that performs well (>0.6 accuracy), let me know. I’d be curious to see it.