Overview

This document will guide you through some data analysis tasks with a focus on performing variable selection. For this exercise, we consider a categorical outcome.

While this is in some sense a stand-alone analysis, I assume that you have worked through the Data Analysis exercise and are familiar with the dataset and all the things we discovered during the cleaning process. We’ll use the same dataset here but focus on a different outcome. Other than that, the way to work through the exercise is like in the Data Analysis exercise, namely by writing/completing the missing code.

Project setup

We need a variety of different packages, which are loaded here. Install as needed. If you use others, load them here.

## Loading required package: ParamHelpers
## 
## Attaching package: 'mlr'
## The following object is masked from 'package:e1071':
## 
##     impute
## The following object is masked from 'package:caret':
## 
##     train

Data loading and cleaning

We will again use the Norovirus dataset.

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Author = col_character(),
##   EpiCurve = col_character(),
##   TDComment = col_character(),
##   AHComment = col_character(),
##   Trans1 = col_character(),
##   Trans2 = col_character(),
##   Trans2_O = col_character(),
##   Trans3 = col_character(),
##   Trans3_O = col_character(),
##   Vehicle_1 = col_character(),
##   Veh1 = col_character(),
##   Veh1_D_1 = col_character(),
##   Veh2 = col_character(),
##   Veh2_D_1 = col_character(),
##   Veh3 = col_character(),
##   Veh3_D_1 = col_character(),
##   PCRSect = col_character(),
##   OBYear = col_character(),
##   Hemisphere = col_character(),
##   season = col_character()
##   # ... with 44 more columns
## )
## See spec(...) for full column specifications.
## Warning: 2 parsing failures.
##  row col expected     actual           file
## 1022 CD  a double GGIIb      'norodata.csv'
## 1022 gge a double Sindlesham 'norodata.csv'

Looking at the outcome

For this analysis, our main outcome of interest is if an outbreak was caused by the G2.4 strain of norovirus or not, and how other factors might be correlated with that strain.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Overall, it looks ok, a decent amout in each category (i.e. no unbalanced data). However, we see an odd coding, there are only 2 types of entries, either “Yes” or "“, i.e. an empty space. The codebook tells us that this is how things were coded, if it was G2.4 it got a”Yes“, if it wasn’t it did get an empty space instead of a”No“. That’s somewhat strange and while the analysis should work, it is better to re-code the empty slots with”No".

##    [1] Yes No  Yes No  Yes No  No  No  Yes No  No  No  No  No  No  No  No 
##   [18] No  Yes No  Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No 
##   [35] Yes Yes Yes No  No  Yes No  No  No  No  No  No  No  No  No  No  Yes
##   [52] Yes No  No  No  Yes No  No  Yes No  No  No  No  No  Yes No  No  No 
##   [69] Yes No  No  Yes No  No  No  No  No  No  Yes Yes Yes Yes Yes No  No 
##   [86] No  No  Yes No  No  No  No  No  Yes Yes Yes No  Yes No  No  No  No 
##  [103] No  No  Yes Yes No  No  No  No  No  No  No  No  Yes Yes No  No  Yes
##  [120] Yes Yes No  Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No  No  No  No 
##  [137] No  No  No  No  No  No  Yes No  No  No  No  Yes Yes No  No  No  No 
##  [154] No  No  No  No  No  No  No  No  Yes Yes No  No  Yes No  No  No  No 
##  [171] No  Yes Yes Yes Yes No  No  No  No  No  No  No  No  Yes No  No  No 
##  [188] Yes Yes Yes Yes No  Yes Yes No  No  No  Yes Yes Yes No  Yes Yes No 
##  [205] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
##  [222] No  No  No  No  No  No  No  No  No  Yes No  No  Yes No  No  No  No 
##  [239] No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No 
##  [256] No  No  No  No  No  No  Yes No  Yes No  No  No  No  No  No  No  No 
##  [273] No  Yes No  No  No  No  No  Yes No  No  No  Yes No  No  No  No  No 
##  [290] Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes
##  [307] No  No  No  No  No  No  Yes No  Yes Yes No  No  No  No  No  No  No 
##  [324] No  No  No  No  No  No  No  No  No  No  No  Yes Yes No  No  No  No 
##  [341] No  Yes Yes No  No  Yes No  No  No  No  No  No  Yes Yes Yes Yes No 
##  [358] No  No  No  No  Yes Yes No  No  No  No  No  No  No  Yes No  Yes No 
##  [375] Yes No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No 
##  [392] No  Yes Yes Yes Yes No  No  No  Yes Yes No  No  No  Yes No  No  No 
##  [409] No  No  Yes No  No  No  Yes No  No  No  Yes No  No  No  Yes Yes Yes
##  [426] No  No  No  No  No  Yes No  No  Yes No  No  Yes Yes No  Yes No  No 
##  [443] No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No 
##  [460] Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
##  [477] No  No  No  No  Yes No  No  No  No  Yes Yes Yes Yes Yes Yes No  No 
##  [494] No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  Yes
##  [511] No  No  No  No  No  Yes No  Yes No  No  No  Yes No  No  No  No  No 
##  [528] No  No  No  No  No  No  No  No  No  No  Yes Yes No  Yes Yes Yes Yes
##  [545] Yes Yes Yes Yes No  No  No  No  Yes No  Yes Yes Yes No  No  No  Yes
##  [562] Yes Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
##  [579] No  Yes No  No  No  No  No  No  No  No  No  Yes No  No  No  Yes No 
##  [596] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
##  [613] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
##  [630] No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes Yes No 
##  [647] No  No  No  Yes No  Yes No  No  Yes Yes Yes Yes No  Yes Yes No  No 
##  [664] Yes Yes No  No  No  No  No  Yes No  No  No  No  No  No  No  No  Yes
##  [681] No  No  Yes No  No  Yes No  No  Yes Yes No  No  No  Yes No  No  No 
##  [698] No  No  No  No  Yes No  No  No  No  No  Yes No  Yes No  Yes Yes No 
##  [715] No  No  Yes No  No  No  No  No  No  Yes No  No  No  Yes No  No  No 
##  [732] Yes Yes Yes No  No  Yes No  Yes Yes No  Yes Yes Yes Yes Yes Yes Yes
##  [749] Yes Yes Yes No  Yes Yes No  No  No  No  Yes Yes Yes No  Yes Yes Yes
##  [766] No  Yes No  Yes Yes Yes No  Yes No  No  Yes No  Yes No  No  No  No 
##  [783] No  No  No  No  No  No  No  No  Yes No  No  Yes Yes Yes No  No  No 
##  [800] No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No 
##  [817] No  No  No  No  No  No  No  No  No  No  No  Yes No  No  Yes Yes No 
##  [834] No  No  No  Yes No  Yes No  No  No  Yes Yes Yes No  Yes No  Yes Yes
##  [851] No  Yes Yes No  Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No  No  No 
##  [868] No  No  No  No  No  No  No  Yes No  Yes Yes Yes No  Yes No  Yes Yes
##  [885] Yes Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes
##  [902] Yes Yes Yes No  No  No  No  No  No  No  No  No  Yes Yes Yes Yes Yes
##  [919] No  Yes Yes No  Yes Yes Yes Yes Yes Yes Yes No  Yes No  Yes Yes No 
##  [936] Yes Yes No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No 
##  [953] No  No  No  No  No  No  No  No  No  No  Yes No  No  Yes No  No  No 
##  [970] No  No  No  No  No  No  No  Yes No  Yes Yes Yes Yes Yes Yes Yes Yes
##  [987] Yes Yes No  No  No  No  No  No  No  No  Yes Yes Yes Yes Yes Yes No 
## [1004] Yes Yes Yes Yes Yes No  No  No  No  No  No  No  Yes No  Yes No  Yes
## [1021] Yes Yes
## Levels: No Yes

Selecting predictors

We will pick similar variables as previously, with some adjustments based on what we learned before. Keep the following variables: Action1, CasesAll, Country, Deaths, GG2C4, Hemisphere, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, OBYear, Path1, RiskAll, Season, Setting, Trans1, Vomit.

## [1] 1022   18

Cleaning predictors

We’ll likely have to perform similar cleaning steps as before. A difference is that earlier we had to drop about half of the observations because no outcome was available. Here, we have outcome for every outbreak. Thus, there is more data to clean, which - as we will see - introduces a few issues we didn’t have before, since we dropped the troublesome observations early in the process.

Let’s first check for missing values.

##          Action1         CasesAll          Country           Deaths 
##                0                7                0               43 
##            gg2c4       Hemisphere Hospitalizations           MeanD1 
##                0                0               43                0 
##           MeanI1         MedianD1         MedianI1           OBYear 
##                0                0                0                0 
##            Path1          RiskAll           season        Setting_1 
##                0              120               67                0 
##           Trans1            Vomit 
##                0                1

Looks like we have again some missing in Hospitalization and Deaths, and some more in RiskAll. Since the missing is not excessive, and to make our life easier, we’ll drop them for now. Note however the ‘blocks’ of missing values for RiskAll. Given that these outbreaks should be entered fairly randomly into the spreadsheet, it is strange to see the NA show up in such blocks. For a real data analysis, it would be worth looking closer and checking why there is a clustering like that.

Let’s make sure everything has the right format (numeric/integer/factor). Adjust/recode variables as needed. You will likely find that as you convert OBYear to numeric, something doesn’t quite work. Take a look. Fix by removing the observation with the troublesome entry, then convert to numeric. Finally, remove the observations that have 0 as OByear - there are more than 1 now.

## 'data.frame':    859 obs. of  18 variables:
##  $ Action1         : Factor w/ 4 levels "No","Unknown",..: 3 3 3 3 3 3 3 3 3 4 ...
##  $ CasesAll        : num  15 65 27 4 15 6 40 10 116 45 ...
##  $ Country         : Factor w/ 22 levels "Australia","Austria",..: 12 22 17 17 17 17 17 17 17 22 ...
##  $ Deaths          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gg2c4           : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 1 1 2 1 ...
##  $ Hemisphere      : Factor w/ 3 levels "Northern","Southern",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Hospitalizations: num  0 0 0 0 0 0 0 0 5 10 ...
##  $ MeanD1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MeanI1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MedianD1        : num  0 36 0 0 0 0 0 0 0 48 ...
##  $ MedianI1        : num  0 37 0 0 0 0 0 0 0 31 ...
##  $ OBYear          : Factor w/ 23 levels "0","1983","1990",..: 11 10 19 19 19 19 19 19 17 5 ...
##  $ Path1           : Factor w/ 4 levels "No","Unknown",..: 1 1 3 3 3 3 3 3 1 3 ...
##  $ RiskAll         : num  0 108 130 4 25 ...
##  $ season          : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Setting_1       : Factor w/ 387 levels "0","2 communities near Apalachicola Bay",..: 119 30 37 309 37 351 37 309 199 1 ...
##  $ Trans1          : Factor w/ 6 levels "Environmental",..: 5 2 2 2 2 2 2 2 5 2 ...
##  $ Vomit           : num  1 1 1 1 1 1 1 1 1 1 ...
## Warning: NAs introduced by coercion

## 'data.frame':    853 obs. of  18 variables:
##  $ Action1         : Factor w/ 4 levels "No","Unknown",..: 3 3 3 3 3 3 3 3 3 4 ...
##  $ CasesAll        : num  15 65 27 4 15 6 40 10 116 45 ...
##  $ Country         : Factor w/ 22 levels "Australia","Austria",..: 12 22 17 17 17 17 17 17 17 22 ...
##  $ Deaths          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gg2c4           : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 1 1 1 2 1 ...
##  $ Hemisphere      : Factor w/ 3 levels "Northern","Southern",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Hospitalizations: num  0 0 0 0 0 0 0 0 5 10 ...
##  $ MeanD1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MeanI1          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MedianD1        : num  0 36 0 0 0 0 0 0 0 48 ...
##  $ MedianI1        : num  0 37 0 0 0 0 0 0 0 31 ...
##  $ OBYear          : num  1999 1998 2006 2006 2006 ...
##  $ Path1           : Factor w/ 4 levels "No","Unknown",..: 1 1 3 3 3 3 3 3 1 3 ...
##  $ RiskAll         : num  0 108 130 4 25 ...
##  $ season          : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Setting_1       : Factor w/ 387 levels "0","2 communities near Apalachicola Bay",..: 119 30 37 309 37 351 37 309 199 1 ...
##  $ Trans1          : Factor w/ 6 levels "Environmental",..: 5 2 2 2 2 2 2 2 5 2 ...
##  $ Vomit           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Next, we remove the Unspecified entry in Hemisphere and recode Action1 and Path1 as described in the Data Analysis script, i.e. from Unknown to Unspecified. Also do the same grouping into just Restaurant and Other with the Setting_1 variable. Again, remember that there are restaurant and Restaurant values, so you need to fix that too.

## [1] Northern    Southern    Unspecified
## Levels: Northern Southern Unspecified
## [1] Northern Southern
## Levels: Northern Southern
## [1] "No"          "Unknown"     "Unspecified" "Yes"
## [1] "No"          "Unknown"     "Unspecified" "Yes"
## [1] "No"          "Unspecified" "Yes"
## [1] "No"          "Unspecified" "Yes"
##  Factor w/ 2 levels "Other","Restaurant": 1 1 1 2 1 2 1 2 1 1 ...

Data visualization

Next, let’s create a few plots showing the outcome and the predictors. For the continuous predictors, I suggest scatter/box/violinplots with the outcome on the x-axis.

Things look ok, apart from the skew in the predictors we discussed previously.

Next, let’s create plots for the categorical variabless. You can use for instance geom_count for it, or some other representation. If you prefer lots of tables, that’s ok too.

## Warning: attributes are not identical across measure variables;
## they will be dropped

You should see from plots or tables that some of the categories are small, e.g. for Action1, the “No” category is very small. Very few entries for a given factor create problems during cross-validation (since we can’t have a level show up in the holdout if it wasn’t part of the fitting set). So let’s look at those factor variables a bit closer and fix as needed.

I didn’t initially get the results described. I initially interpreted the previous step where we dropped NAs to mean that we drop ALL NAs, not just the ones in the three categories.

## 
## Northern Southern 
##      799       53
## 
##          No Unspecified         Yes 
##         197         615          40
## 
##   Fall Spring Summer Winter 
##    135    187    103    371
## 
##      Other Restaurant 
##        671        181
## 
##    Environmental        Foodborne Person to Person          Unknown 
##               11              339              119               61 
##      Unspecified       Waterborne 
##              261               61
## 
##          No Unspecified         Yes 
##           1         679         172
## 
##       Japan    Multiple       Other Unspecified         USA 
##         316          16         410          12          98
##  Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## [1] "Fall"   "Spring" "Summer" "Winter"

You should see from your explorations above that there is only a single entry for No in Action1, and small entries for Multiple and Unspecified in Country and Environmental in Trans1. The single No should be fixed, the other somewhat small groupings might be ok. It depends on scientific rationale and method of analysis if you should modify/group those or not. We’ll do that.

Change things as follows: Remove the observation for which action is No, combine Countries into 3 groups Japan/USA/Other, and since I don’t even know what biologically the difference is between Environmental and Waterborne transmission (seems the same route to me based on my norovirus knowledge), we’ll move the Waterborne into the environmental. Finally, I noticed there is a blank category for season. That likely means it wasn’t stated in the paper. Let’s recode that as Unknown. Finally, re-order data such that the outcome is the first column and again remove empty factor levels. Then look at your resulting data frame.

## [1] "Japan"       "Multiple"    "Other"       "Unspecified" "USA"
## Warning: Unknown levels in `f`: Australia, Austria, Brazil, Ca da, Chi,
## Croatia, Denmark, France, Iraq, Israel, Italy, Netherlands, New Zealand,
## Norway, Scotland, Spain, UK
## [1] "Japan" "Other" "USA"
## [1] "Environmental"    "Foodborne"        "Person to Person"
## [4] "Unknown"          "Unspecified"      "Waterborne"
## [1] "Environmental"    "Foodborne"        "Person to Person"
## [4] "Unspecified"
## [1] "Fall"   "Spring" "Summer" "Winter"
## [1] Fall   Spring Summer Winter <NA>  
## Levels: Fall Spring Summer Winter
## [1] Fall    Spring  Summer  Winter  Unknown
## Levels: Fall Spring Summer Winter Unknown
## [1] "Fall"    "Spring"  "Summer"  "Winter"  "Unknown"

At this step, you should have a dataframe containing 850 observations, and 18 variables: 1 outcome, 9 numeric/integer predictors, and 8 factor variables. There should be no missing values. The outcome, gg2c4, should be in the 1st slot.

Data splitting

We could do data splitting again as we did in the previous exercise, to have a final test set. But since you saw how it works in the previous exercise, we skip it here. We use the full data to fit to our models. We’ll still use cross-validation to get a more honest estimate of model performance. For a real data analysis, the choice to keep some for a final test or not is based on your goals. If your focus is predictive performance, you should consider this split. If your focus is inference or exploratory analysis, you might want to skip this.

Model fitting

So I had planned to use caret exclusivley for this course. Last time I tried feature/subset selection with caret, I found it buggy and not too well documented. I had hoped this had improved since. Unfortunately, I was again not really able to get things to work. So even though I said we likely won’t use the mlr package, I decided that to be able to nicely practice feature/subset selection, we need to do so. It’s not a bad idea to get familiar with that package, at times caret can do things mlr can’t do and the reverse. So knowing how to use both is good. We’ll thus use mlr to do our model fitting in this exercise.

Parallelization

mlr allows you to run things in parallel by using multiple cores (caret does too). For instance, if you do 5x cross-validation 5x repeated, you essentially run a very similar piece of code 25 times. Normally, you would run one at a time. But if you have it on a machine with 25 cores, all those 25 could run at the same time, giving you a speed-up of 25 (or close to, there is usually some overhead related to doing things in parallel). This kind of parallel computing is sometimes called embarassingly parallel, because it’s so embarassingly simple to split the task into parallel streams.

Since doing the subset selection below starts getting slow, and because it’s a good topic to know about, we are going to use parallelization here. mlr uses the package parallelMap for this. All you need to do is specify the number of cores/processors you want to use and start the parallel system, and mlr then automatically does things in parallel if feasible.

Setup

Some setup settings that are used in various code chunks below.

A null model

To define a null model, we need to determine what performance measure we want to track. As mentioned in the course materials, there are different performance measures. Accuracy or misclassification error is simple, it just counts the number of times the model got it right/wrong. We’ll start with that one, and then try another one later. mlr allows for a lot of different performance measures for both categorical and continuous outcomes, see here and here.

For accuracy, the simplest null model always predicts the most frequent category. We can use that as baseline performance.

## 
##  No Yes 
## 591 259
## [1] 0.6952941

You should find that the null model has an accuracy of around 0.69.

Single predictor models

Now let’s consider single predictor models, i.e. we’ll fit the outcome to each predictor one at a time to get an idea of the importance of individual predictors. To evaluate our model performance, we will use cross-validation. Since our outcome is categorical, we’ll use a logistic model.

variable Accuracy
Action1 0.6952941
CasesAll 0.6941176
Country 0.6952941
Deaths 0.6976471
Hemisphere 0.6952941
Hospitalizations 0.6943529
MeanD1 0.6952941
MeanI1 0.6952941
MedianD1 0.6952941
MedianI1 0.6952941
OBYear 0.6950588
Path1 0.6952941
RiskAll 0.6957647
season 0.6934118
Trans1 0.6952941
Vomit 0.6952941
Setting 0.6952941

So looks like none of the single predictor models have a higher accuracy than the null. Maybe surprising. We’ll get back to that below.

Full model

Now let’s fit a full logistic model with all predictors.

## acc.test.mean 
##     0.7084706

So a very small improvement over the simple models.

Now let’s do subset selection. The code below does it several ways. It does regular forward and backward selection and floating versions of those. It also uses a genetic algorithm for selection. See the mlr website here for details.

Also note that I included a kind of timer in the code, to see how long things take. That’s a good idea if you run bigger models. You first run a few iterations on maybe a few cores, then you can compute how long it would take if you doubled the iterations, or doubled the cores, etc. This prevents bad surprises of trying to quickly run a piece of code and waiting hours. You should always use short runs to make sure everything works in principle, and only at the end do the real, long “production” runs. Otherwise you might waste hours/days/weeks waiting for results only to realize that you made a basic mistake and you have to do it all over.

## [1] "doing subset selection with method sbs "
## [1] "doing subset selection with method sfbs "
## [1] "doing subset selection with method sfs "
## [1] "doing subset selection with method sffs "
## [1] "doing subset selection with genetic algorithm"
## [1] "subset selection took 2.358467 minutes"
method Accuracy Model
sbs 0.7261176 Action1, Country, Deaths, Hemisphere, OBYear, RiskAll, season, Trans1, Vomit, Setting
sfbs 0.7155294 Action1, Country, Deaths, OBYear, season, Trans1, Setting
sfs 0.6952941
sffs 0.6952941
GA 0.7263529 Action1, Country, Deaths, Hemisphere, OBYear, Path1, RiskAll, season, Trans1, Vomit, Setting

So we find that some of the sub-models perform somewhat better. Note that the different methods find different submodels and the forward selection method seems to fail (since none of the single-predictor models is better than the null, it stops there). Also note that the genetic algorithm was only allowed to run for a short time (the number if iterations is small). To get potentially better results, that should be increased to a larger number, e.g. 1000 or more. But this will take a longer time, unless you have many cores you can use when you parallelize.

Different performance measures

So far, we found that some of the sub-models provide a minor improvement over the simple null or single-predictor models or the full model. However, the improvement is not much.

We could try tweaking further, e.g. by pre-processing the predictors (which is a good idea to try, but we won’t do here) or testing a different model (which we’ll do below).

For now, I want to discuss performance measure a bit more. The problem is that logistic regression, as well as many (though not all) algorithms that are used for classification (predicting categorical outcomes) return probabilities for an outcome to belong to a certain category. Those are then turned into Yes/No predictions (in the case with 2 outcomes like we have here), based on some cut-off. By default, a probability value of 0.5 is chosen. If the model predicts that there is a 0.5 or greater probability of the outcome being “Yes”, it is scored as such in the prediction, otherwise as no. However, this 0.5 threshold is not always good. It could be that we might get a much better model if we use another threshold. There is a technique that scans over different thresholds and evaluates model performance for each threshold. This is generally called receiver operating curve. If this concept is new to you, check out this article on Wikipedia or this tutorial on the mlr website.

Using this approach, a model is evaluated based on the Area under the curve (AUC), with an AUC=0.5 meaning a model is not better than chance, and AUC=1 meaning a perfect model. Let’s re-do the analysis above, but replace accuracy with AUC.

Model performance with AUC

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
## Area under the curve: 0.5

You should find that the null model has an AUC of 0.5, as expected for a “no information” model.

Now we run the single-predictor models again.

variable Accuracy
Action1 0.5154141
CasesAll 0.4649312
Country 0.5782066
Deaths 0.5160717
Hemisphere 0.5090307
Hospitalizations 0.5011048
MeanD1 0.4952809
MeanI1 0.4995563
MedianD1 0.5171011
MedianI1 0.5285258
OBYear 0.6015035
Path1 0.5263330
RiskAll 0.5715641
season 0.5879504
Trans1 0.5769839
Vomit 0.5297235
Setting 0.5641744

Looks like there are a few single-predictor models with better performance than the null model.

Now let’s again fit a full model with all predictors.

## auc.test.mean 
##     0.6752942

This is better than any single-predictor models. Let’s see if a sub-model can do even better.

## [1] "doing subset selection with method sbs "
## [1] "doing subset selection with method sfbs "
## [1] "doing subset selection with method sfs "
## [1] "doing subset selection with method sffs "
## [1] "doing subset selection with genetic algorithm"
## [1] "subset selection took 2.532800 minutes"
method Accuracy Model
sbs 0.6856703 Action1, Country, Deaths, MedianI1, OBYear, RiskAll, season, Setting
sfbs 0.6782245 Action1, Country, Deaths, MedianI1, OBYear, season, Setting
sfs 0.6555854 OBYear, season, Setting
sffs 0.6638263 Country, OBYear, season, Setting
GA 0.6839415 Action1, Country, Deaths, MedianI1, OBYear, RiskAll, season, Setting

Looks like we get some sub-models that are slightly better than the full model, at least as evaluated using the (cross-validated) AUC measure.

More on ROC/AUC

While this is a common measure and useful to evaluate how models perform at different cut-off levels, one needs to keep in mind that this doesn’t really measure the performance of a single model, but instead the combined performance of all models with different cut-offs. In practice, if we want to use a model, we have to pick one cut-off. For that purpose, one can look at the whole ROC curve and one then usually chooses the model in the “top-left” corner of the curve, which gives the best overall performance with a good balance between FP and TP. Let’s plot that ROC curve for the model found by the GA.

Based on this plot, a FP of around 0.5 an TP of around 0.75 seem to produce a good model. Digging into the df object, you can find the raw data underlying the curve in df$data and will find

##          fpr       tpr threshold
## 29 0.4433164 0.7722008 0.2828283
## 30 0.4297800 0.7644788 0.2929293
## 31 0.4145516 0.7335907 0.3030303

This means for this model, if we were to use it we might want to set a threshold of around 0.3, i.e. everything above that probability would be predicted as positive/yes.

Trying a different model

Let’s see if we can find another model that might perform even better. Let’s look at a Linear Discriminant Analysis (LDA) model. See e.g. chapter 4 of ISLR for more on that model. We again consider AUC as our measure. The null-model performance is the same, we need to re-do the remainder.

Let’s re-run the single model, followed by the full model and subset selection.

variable Accuracy
Action1 0.5154141
CasesAll 0.4649312
Country 0.5782066
Deaths 0.5160717
Hemisphere 0.5090307
Hospitalizations 0.5011048
MeanD1 0.4952809
MeanI1 0.4995563
MedianD1 0.5171011
MedianI1 0.5285434
OBYear 0.6015035
Path1 0.5263330
RiskAll 0.5715641
season 0.5879504
Trans1 0.5769839
Vomit 0.5297235
Setting 0.5641744
## auc.test.mean 
##     0.6750281
## [1] "doing subset selection with method sbs "
## [1] "doing subset selection with method sfbs "
## [1] "doing subset selection with method sfs "
## [1] "doing subset selection with method sffs "
## [1] "doing subset selection with genetic algorithm"
## [1] "subset selection took 2.189400 minutes"
method Accuracy Model
sbs 0.6855919 Action1, Country, Deaths, MedianI1, OBYear, RiskAll, season, Setting
sfbs 0.6769966 Action1, Country, MedianI1, OBYear, RiskAll, season, Setting
sfs 0.6566321 OBYear, season, Setting
sffs 0.6654399 Country, OBYear, season, Setting
GA 0.6834578 Action1, Country, Deaths, MedianI1, OBYear, RiskAll, season, Setting

It Looks like we again get some sub-models that are slightly better than the full model, at least as evaluated using the (cross-validated) AUC measure. There seems to be a bit, but not much difference in performance to the logistic model. We can take the GA model again (because it’s conveniently at the end, even if it’s not the best) and look at its ROC and compare the curves for the LDA and logistic models.

You should see from the plot that the LDA model performs very similar to the Logistic model.

Wrapping up

Based on the above, we could choose a model simply by best performance. However, we likely also still want to look at model prediction uncertainty and do some residual plots and other diagnostics for our chosen model before we finalize our choice. As was the case for caret, mlr also comes with lots of functions that make these - and many other - tasks easier. We’ll leave it at this for now, and might revisit some of those topics in further exercises.

Also, none of the models here are that great. We might want to think more about our initial premise, i.e. looking for associations between virus strain type and other variables and what the scientific rationale is for expecting variations. Here, we just ran the model and looked what we might find. That approach, sometimes called data exploration or data mining or - less charitable fishing expedition is ok, but we need to be careful how we interpret and use the results. Going in with a clear hypothesis is usually better.

A few more comments

Things are getting a bit slow now, despite the multiple cores we need minutes to run things. Also, code chunks get larger. At this point, it might be worth considering a structure whereby most of the code lives in one or several well documented R scripts, which can be run independently. Those R scripts should save their results, and those results are then loaded here. The advantage is that if one wants to make changes to a small part of the analysis, one can modify and run just that part and update the whole document by re-knitting without having to run all code. For any big/serious analysis, I suggest such a setup that splits R code from RMarkdown files, at least for the computationally intensive parts.