Machine Learning with R - an mlr Starter
mlr is a fairly comprehensive machine learning package in R. Despite the many tutorials out there, I didn’t find a handy yet comprehensive one that I can refer to for my common machine learning questions, so I set out to write one.
The report was created using R Markdown and then published here and also on rpubs.com, a free web publishing service sponsored by RStudio. Please see the full report below and here.
SUMMARY
A brief walk-through of mlr functions that are necessary for basic model tunning.
mlr vs caret
mlr has more possibilities to create different types of tasks: Classification, Regression, Survival, Clustering, Multilabel, Cost-sensitive, Imbalanced data, Functional data, Spatial data. caret only seems to have possibilities for Classification, Regression and Cost-sensitive.
mlr has 71 metrics which are easy to choose from, more complicated to change metrics in caret.
mlr supports more tuning methods such as F-Racing and model-based optimization, while caret supports only random and grid search.
mlr is more compatible for parallelization.
Preprocessing
Goal:
- Load data
- Quick preprocessing
data(BreastCancer, package = 'mlbench')
df <- BreastCancer
str(df)
## 'data.frame': 699 obs. of 11 variables:
## $ Id : chr "1000025" "1002945" "1015425" "1016277" ...
## $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
summary(df)
## Id Cl.thickness Cell.size Cell.shape
## Length:699 1 :145 1 :384 1 :353
## Class :character 5 :130 10 : 67 2 : 59
## Mode :character 3 :108 3 : 52 10 : 58
## 4 : 80 2 : 45 3 : 56
## 10 : 69 4 : 40 4 : 44
## 2 : 50 5 : 30 5 : 34
## (Other):117 (Other): 81 (Other): 95
## Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli
## 1 :407 2 :386 1 :402 2 :166 1 :443
## 2 : 58 3 : 72 10 :132 3 :165 10 : 61
## 3 : 58 4 : 48 2 : 30 1 :152 3 : 44
## 10 : 55 1 : 47 5 : 30 7 : 73 2 : 36
## 4 : 33 6 : 41 3 : 28 4 : 40 8 : 24
## 8 : 25 5 : 39 (Other): 61 5 : 34 6 : 22
## (Other): 63 (Other): 66 NA's : 16 (Other): 69 (Other): 69
## Mitoses Class
## 1 :579 benign :458
## 2 : 35 malignant:241
## 3 : 33
## 10 : 14
## 4 : 12
## 7 : 9
## (Other): 17
Looks like we have an ID column, which we should remove in the prediction. We also have some NA values for the Bare.nuclei column, let’s set the NAs to be a different class.
df$Id = NULL
new_col <- as.numeric(df$Bare.nuclei)
new_col[is.na(new_col)] <- 5.5
new_col <- factor(new_col, levels = c("1", "2", "3", "4", "5", "5.5", "6", "7", "8", "9", "10"), ordered = T)
table(new_col)
## new_col
## 1 2 3 4 5 5.5 6 7 8 9 10
## 402 30 28 19 30 16 4 8 21 9 132
df$Bare.nuclei <- new_col
df$Epith.c.size <- NULL
Other options:
- createDummyFeatures
- normalizeFeatures (center, scale, standardize, range)
- mergeSmallFactorLevels
- capLargeValues, dropFeatures, removeConstantFeatures, summarizeLevels
- interface to caret preprocessing …
Tasks
Goal:
- Set data (data)
- Set target (target)
- Set task ID (id)
bc_task = makeClassifTask(data = df, target = 'Class')
bc_task
## Supervised task: df
## Type: classif
## Target: Class
## Observations: 699
## Features:
## numerics factors ordered functionals
## 0 3 5 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## benign malignant
## 458 241
## Positive class: benign
bc_task = makeClassifTask(id = 'breast cancer classification',
data = df, target = 'Class',
positive = "malignant")
Notes:
- For classification the target column has to be a factor.
- makeClassifTask() by default selects the first factor level of the target variable as the positive class. When necessary, use the positive argument to modify.
Other tasks:
- RegrTask() for regression problems
- ClassifTask() for binary and multi-class classification problems with class-dependent costs can be handled as well)
- SurvTask() for survival analysis
- ClusterTask() for cluster analysis
- MultilabelTask() for multilabel classification problems
- CostSensTask() for general cost sensitive classification (with example-specific costs)
Learners
Goal:
- Set package and method (class of learner, cl argument)
- Set prediction output type (predict.type)
- Set hyperparameters (par.vals)
- Set learner ID (id)
bc_learner = makeLearner(cl = "classif.ranger", # class of learner
predict.type = "prob", # prediction output type
par.vals = list(), # set hyperparameters
predict.threshold = NULL, # threshold for prediction
fix.factors.prediction = TRUE,
id = 'bc ranger') # add a factor level for missing data in test
bc_learner
## Learner bc ranger from package ranger
## Type: classif
## Name: Random Forests; Short name: ranger
## Class: classif.ranger
## Properties: twoclass,multiclass,prob,numerics,factors,ordered,featimp,weights,oobpreds
## Predict-Type: prob
## Hyperparameters: num.threads=1,verbose=FALSE,respect.unordered.factors=order
To see available learners for the particular task:
learners <- listLearners(bc_task, properties = 'prob', check.packages = FALSE)
head(learners[,c('class', 'package')])
## class package
## 1 classif.cforest party
## 2 classif.ctree party
## 3 classif.evtree evtree
## 4 classif.featureless mlr
## 5 classif.penalized penalized
## 6 classif.randomForest randomForest
Note that somehow this function is not returning all available options for me, so I’d recommend checking out link for all Integrated learners:.
A handy function to examine the available hyperparameters is to use the getParamSet function:
getParamSet(bc_learner)
## Type len Def
## num.trees integer - 500
## mtry integer - -
## min.node.size integer - -
## replace logical - TRUE
## sample.fraction numeric - -
## split.select.weights numericvector <NA> -
## always.split.variables untyped - -
## respect.unordered.factors discrete - ignore
## importance discrete - none
## write.forest logical - TRUE
## scale.permutation.importance logical - FALSE
## num.threads integer - -
## save.memory logical - FALSE
## verbose logical - TRUE
## seed integer - -
## splitrule discrete - gini
## num.random.splits integer - 1
## keep.inbag logical - FALSE
## Constr Req Tunable Trafo
## num.trees 1 to Inf - TRUE -
## mtry 1 to Inf - TRUE -
## min.node.size 1 to Inf - TRUE -
## replace - - TRUE -
## sample.fraction 0 to 1 - TRUE -
## split.select.weights 0 to 1 - TRUE -
## always.split.variables - - TRUE -
## respect.unordered.factors ignore,order,partition - TRUE -
## importance none,impurity,permutation - FALSE -
## write.forest - - FALSE -
## scale.permutation.importance - Y FALSE -
## num.threads 1 to Inf - FALSE -
## save.memory - - FALSE -
## verbose - - FALSE -
## seed -Inf to Inf - FALSE -
## splitrule gini,extratrees - TRUE -
## num.random.splits 1 to Inf Y TRUE -
## keep.inbag - - FALSE -
List of properties:
- numerics: The method can cope with numerical predictors
- factors: The method can cope with factor predictors
- ordered: The method can cope with ordered factor predictors
- NAs: The method can deal with missing values in a meaningful way (other than simply removing observations with missing values)
- weights: The method support observation weights
- prob: The method can predict probabilities
- oneclass, twoclass, multiclass: One-class, two-class (binary) or multi-class classification problems be handled
- class.weights: Class weights can be handled
- featimp: The method can return feature importance
- oobpreds: The method can make out-of-bag predictions
Train
Goal:
- Set learner
- Set task
- Set subset of data
- Set observation weights
bc_model <- train(learner = bc_learner,
task = bc_task)
bc_model
## Model for learner.id=bc ranger; learner.class=classif.ranger
## Trained on: task.id = breast cancer classification; obs = 699; features = 8
## Hyperparameters: num.threads=1,verbose=FALSE,respect.unordered.factors=order
If the default setting of the learner is okay, you can skip makeLearner() and just start to train the model.
bc_model_default <- train(learner = 'classif.ranger',
task = bc_task)
bc_model_default
## Model for learner.id=classif.ranger; learner.class=classif.ranger
## Trained on: task.id = breast cancer classification; obs = 699; features = 8
## Hyperparameters: num.threads=1,verbose=FALSE,respect.unordered.factors=order
To train only a subset, use the subset argument.
n <- nrow(df)
in_train <- sample.int(n, round(0.8*n))
in_test <- setdiff(1:n, in_train)
bc_model <- train(learner = bc_learner,
task = bc_task,
subset = in_train)
To assign weight to observations if supported by learner. Use cases:
- reduce the influence of outliers 2. increase the influence of certain data, e.g. more recent data 3. incorporate misclassification costs 4. account for class imbalance, give both classes equal importance
target <- df$Class[in_train]
tab <- as.numeric(table(target))
w <- 1/tab[target]
bc_model <- train(learner = bc_learner,
task = bc_task,
subset = in_train,
weights = w)
Predict
Goal:
- make predictions
- adjust threshold
- visualize the prediction
To make a prediction:
bc_pred <- predict(object = bc_model,
task = bc_task,
subset = in_test)
bc_pred
## Prediction: 140 observations
## predict.type: prob
## threshold: benign=0.50,malignant=0.50
## time: 0.04
## id truth prob.benign prob.malignant response
## 4 4 benign 0.1196789 0.8803211018 malignant
## 9 9 benign 0.9138379 0.0861621203 benign
## 13 13 malignant 0.3075421 0.6924579365 malignant
## 21 21 malignant 0.1363444 0.8636555556 malignant
## 24 24 malignant 0.1332519 0.8667480859 malignant
## 28 28 benign 0.9995886 0.0004114003 benign
## ... (#rows: 140, #cols: 5)
To adjust threshold:
bc_pred_2 <- setThreshold(pred = bc_pred, threshold = 0.8)
To visualize the prediction:
# plotLearnerPrediction(learner = bc_learner, task = bc_task,
# features = c('Cell.shape', 'Marg.adhesion'))
lrn = makeLearner("classif.rpart", id = "CART")
plotLearnerPrediction(lrn, task = iris.task)
Figure note: The type of symbol shows the true class labels of the data points. Symbols with white border indicate misclassified observations. The posterior probabilities (if the learner under consideration supports this) are represented by the background color where higher saturation means larger probabilities.
Handy functions:
- getPredictionTruth( bc_pred )
- getPredictionResponse( bc_pred )
- getPredictionSE() for standard errors of regression predictions
- getPredictionProbabilities( bc_pred )
- calculateConfusionMatrix()
Performance
Goal:
- access performance measures
- plot performance
To access performance measures:
## list measures
listMeasures("classif", properties = "classif.multi")
## [1] "featperc" "mmce" "lsr"
## [4] "bac" "qsr" "timeboth"
## [7] "multiclass.aunp" "timetrain" "multiclass.aunu"
## [10] "ber" "timepredict" "multiclass.brier"
## [13] "ssr" "acc" "logloss"
## [16] "wkappa" "multiclass.au1p" "multiclass.au1u"
## [19] "kappa"
Task-specific measures:
listMeasures(bc_task)
## [1] "tnr" "tpr" "featperc"
## [4] "f1" "mmce" "mcc"
## [7] "brier.scaled" "lsr" "bac"
## [10] "fn" "fp" "fnr"
## [13] "qsr" "fpr" "npv"
## [16] "brier" "auc" "timeboth"
## [19] "multiclass.aunp" "timetrain" "multiclass.aunu"
## [22] "ber" "timepredict" "multiclass.brier"
## [25] "ssr" "ppv" "acc"
## [28] "logloss" "wkappa" "tn"
## [31] "tp" "multiclass.au1p" "multiclass.au1u"
## [34] "fdr" "kappa" "gpr"
## [37] "gmean"
To access performance measures:
performance(pred = bc_pred, measures = list(f1, fpr, auc, acc))
## f1 fpr auc acc
## 0.96226415 0.03409091 0.99125874 0.97142857
To plot performance vs threshold:
d <- generateThreshVsPerfData(bc_pred, measures = list(fpr, fnr))
plotThreshVsPerf(d)
Resampling
Goal:
- set resampling strategy, e.g., cross-validation, holdout
- set resampling instance
To set resampling:
# create resampling strategy, note that cv5 is pre-defined in mlr as well
cv5 <- makeResampleDesc(method = 'CV', iters = 5, stratify = T)
bc_cv5 <- resample(learner = bc_learner,
task = bc_task,
resampling = cv5,
measures = auc,
models = T) # to keep the model
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.9954212
## [Resample] iter 2: 0.9917582
## [Resample] iter 3: 0.9907156
## [Resample] iter 4: 0.9993207
## [Resample] iter 5: 0.9811446
##
## Aggregated Result: auc.test.mean=0.9916721
##
To access the resampling results:
bc_cv5$aggr
## auc.test.mean
## 0.9916721
bc_cv5$measures.test
## iter auc
## 1 1 0.9954212
## 2 2 0.9917582
## 3 3 0.9907156
## 4 4 0.9993207
## 5 5 0.9811446
To set resampling instance:
bc_cv5_in <- makeResampleInstance(desc = cv5, task = bc_task)
Resample instances make it easy to reproduce model results and to compare models. We can run different models on the same resampling instance with the same split of the data, which is more fair than comparisons across different splits of the data.
Tuning
Goal:
- tune hyperparameters
Steps:
First, set the search space:
bc_params <- makeParamSet(
makeIntegerParam('num.trees', lower = 100, upper = 1000),
makeIntegerParam('mtry', lower = 1, upper = 9)
)
More options:
- makeNumericParam(“u”, lower=1)
- makeIntegerParam(“v”, lower=1, upper=2)
- makeDiscreteParam(“w”, values=1:2)
- makeLogicalParam(“x”)
- makeDiscreteVectorParam(“y”, len=2, values=c(“a”, “b”))
Second, select the optimization algorithm:
bc_control = makeTuneControlRandom(maxit = 5L)
More options:
- makeTuneControlGrid()
- makeTuneControlIrace()
Third, we select the resampling strategy and the performance measure for the tuning. In this case let’s use the pre-defined resampling method cv5.
bc_tune <- tuneParams(learner = bc_learner,
task = bc_task,
resampling = cv5,
par.set = bc_params,
control = bc_control,
measures = auc)
## [Tune] Started tuning learner bc ranger for parameter set:
## Type len Def Constr Req Tunable Trafo
## num.trees integer - - 100 to 1e+03 - TRUE -
## mtry integer - - 1 to 9 - TRUE -
## With control class: TuneControlRandom
## Imputation value: -0
## [Tune-x] 1: num.trees=256; mtry=4
## [Tune-y] 1: auc.test.mean=0.9919186; time: 0.0 min
## [Tune-x] 2: num.trees=334; mtry=7
## [Tune-y] 2: auc.test.mean=0.9895292; time: 0.0 min
## [Tune-x] 3: num.trees=763; mtry=7
## [Tune-y] 3: auc.test.mean=0.9915533; time: 0.0 min
## [Tune-x] 4: num.trees=350; mtry=8
## [Tune-y] 4: auc.test.mean=0.9896999; time: 0.0 min
## [Tune-x] 5: num.trees=769; mtry=5
## [Tune-y] 5: auc.test.mean=0.9919644; time: 0.0 min
## [Tune] Result: num.trees=769; mtry=5 : auc.test.mean=0.9919644
To access the tuning results:
# best-performing parameters
bc_tune$x
## $num.trees
## [1] 769
##
## $mtry
## [1] 5
# best performance
bc_tune$y
## auc.test.mean
## 0.9919644
To update the learner with the best parameters:
final_bc_learner <- setHyperPars(learner = bc_learner,
par.vals = bc_tune$x)
final_bc_model <- train(learner = final_bc_learner,
task = bc_task,
subset = in_train)
final_bc_predict <- predict(object = final_bc_model,
task = bc_task,
subset = in_test)
To view the tuning effect:
tune_data = generateHyperParsEffectData( tune.result = bc_tune )
plotHyperParsEffect(hyperpars.effect.data = tune_data,
x = "iteration", y = "auc.test.mean",
plot.type = "line")
Benchmark Experiments
Goal:
- compare and rank the algorithms
Steps:
First, build individual learners to compare and put in a list.
# xgboost learner
bc_learner_rpart <- makeLearner(cl = 'classif.rpart',
predict.type = 'prob',
id = 'bc rpart')
# list learners together
learners <- list(bc_learner, bc_learner_rpart)
Second, choose the resampling strategy and benchmark. Note that we are comparing the learners on the default hyperparameters before tuning. If you want each model to pull out its A-game for the benchmarking, make sure to tune the parameters beforehand.
# choose the resampling strategy
bc_cv5_in <- makeResampleInstance(desc = cv5, task = bc_task)
bc_benchmark <- benchmark(learners = learners,
task = bc_task,
resamplings = bc_cv5_in,
measures = auc,
keep.pred = F) # drop predictions to conserve memeory
## Task: breast cancer classification, Learner: bc ranger
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.9922161
## [Resample] iter 2: 0.9927536
## [Resample] iter 3: 0.9898098
## [Resample] iter 4: 0.9891304
## [Resample] iter 5: 0.9948419
##
## Aggregated Result: auc.test.mean=0.9917504
##
## Task: breast cancer classification, Learner: bc rpart
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.9371566
## [Resample] iter 2: 0.9718071
## [Resample] iter 3: 0.9527853
## [Resample] iter 4: 0.9642210
## [Resample] iter 5: 0.9543620
##
## Aggregated Result: auc.test.mean=0.9560664
##
Third, understand the benchmark results.
# performance per iteration
getBMRPerformances(bc_benchmark)
## $`breast cancer classification`
## $`breast cancer classification`$`bc ranger`
## iter auc
## 1 1 0.9922161
## 2 2 0.9927536
## 3 3 0.9898098
## 4 4 0.9891304
## 5 5 0.9948419
##
## $`breast cancer classification`$`bc rpart`
## iter auc
## 1 1 0.9371566
## 2 2 0.9718071
## 3 3 0.9527853
## 4 4 0.9642210
## 5 5 0.9543620
# aggregate performance
getBMRAggrPerformances(bc_benchmark, as.df = T) # as.df make the results easier to read
## task.id learner.id auc.test.mean
## 1 breast cancer classification bc ranger 0.9917504
## 2 breast cancer classification bc rpart 0.9560664
Further, to visualize the benchmark results.
plotBMRBoxplots(bmr = bc_benchmark,
measure = auc)