install.packages("mlr3verse")
1 Introduction and Overview
Lars Kotthoff
University of Wyoming
Raphael Sonabend
Imperial College London
Natalie Foss
University of Wyoming
Bernd Bischl
Ludwig-Maximilians-Universität München, and Munich Center for Machine Learning (MCML)
Welcome to the Machine Learning in R universe. In this book, we will guide you through the functionality offered by mlr3
step by step. If you want to contribute to our universe, ask any questions, read documentation, or just chat with the team, head to https://github.com/mlr-org/mlr3 which has several useful links in the README.
The mlr3
(Lang et al. 2019) package and the wider mlr3
ecosystem provide a generic, object-oriented, and extensible framework for regression (Section 2.1), classification (Section 2.5), and other machine learning tasks (Chapter 13) for the R language (R Core Team 2019). On the most basic level, the unified interface provides functionality to train, test, and evaluate many machine learning algorithms. You can also take this a step further with hyperparameter optimization, computational pipelines, model interpretation, and much more. mlr3
has similar overall aims to caret
and tidymodels
for R, scikit-learn
for Python, and MLJ
for Julia. In general, mlr3
is designed to provide more flexibility than other ML frameworks while still offering easy ways to use advanced functionality. While tidymodels
in particular makes it very easy to perform simple ML tasks, mlr3
is more geared towards advanced ML.
Before we can show you the full power of mlr3
, we recommend installing the mlr3verse
package, which will install several, important packages in the mlr3
ecosystem.
Chapters that were added after the release of the printed version of this book are marked with a ‘+’.
1.1 Installation Guidelines
There are many packages in the mlr3
ecosystem that you may want to use as you work through this book. All our packages can be installed from GitHub and R-universe1; the majority (but not all) packages can also be installed from CRAN. We recommend adding the mlr-org R-universe to your R options so you can install all packages with install.packages()
, without having to worry which package repository it comes from. To do this, install usethis
and run the following:
1 R-universe is an alternative package repository to CRAN. The bit of code below tells R to look at both R-universe and CRAN when trying to install packages. R will always install the latest version of a package.
usethis::edit_r_profile()
In the file that opens add or change the repos
argument in options
so it looks something like the code below (you might need to add the full code block below or just edit the existing options
function).
Save the file, restart your R session, and you are ready to go!
If you want the latest development version of any of our packages, run
remotes::install_github("mlr-org/{pkg}")
with {pkg}
replaced with the name of the package you want to install. You can see an up-to-date list of all our extension packages at https://github.com/mlr-org/mlr3/wiki/Extension-Packages.
1.2 How to Use This Book
You could read this book cover to cover but you may benefit more from dipping in and out of chapters as suits your needs, we have provided a comprehensive index to help you find relevant pages and sections. We do recommend reading the first part of the book in its entirety as this will provide you with a complete overview of our basic infrastructure and design, which is used throughout our ecosystem.
We have marked sections that are particularly complex with respect to either technical or methodological detail and could be skipped on a first read with the following information box:
Each chapter includes examples, API references, and explanations of methodologies. At the end of each part of the book we have included exercises for you to test yourself on what you have learned; you can find the solutions to these exercises at https://mlr3book.mlr-org.com/solutions.html. We have marked more challenging (and possibly time-consuming) exercises with an asterisk, ’*’.
If you want more detail about any of the tasks used in this book or links to all the mlr3
dictionaries, please see the appendices in the online version of the book at https://mlr3book.mlr-org.com/.
Reproducibility
At the start of each chapter we run set.seed(123)
and use renv
to manage package versions, you can find our lockfile at https://github.com/mlr-org/mlr3book/blob/main/book/renv.lock.
1.3 mlr3book Code Style
Throughout this book we will use the following code style:
We always use
=
instead of<-
for assignment.Class names are in
UpperCamelCase
Function and method names are in
lower_snake_case
When referencing functions, we will only include the package prefix (e.g.,
pkg::function
) for functions outside themlr3
universe or when there may be ambiguity about in which package the function lives. Note you can useenvironment(function)
to see which namespace a function is loaded from.-
We denote packages, fields, methods, and functions as follows:
-
package
(highlighted in the first instance) -
package::function()
orfunction()
(see point 4) -
$field
for fields (data encapsulated in an R6 class) -
$method()
for methods (functions encapsulated in an R6 class) -
Class
(for R6 classes primarily, these can be distinguished from packages by context)
-
Now let us see this in practice with our first example.
1.4 mlr3 by Example
The mlr3
universe includes a wide range of tools taking you from basic ML to complex experiments. To get started, here is an example of the simplest functionality – training a model and making predictions.
library(mlr3)
task = tsk("penguins")
split = partition(task)
learner = lrn("classif.rpart")
learner$train(task, row_ids = split$train)
learner$model
n= 230
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 230 131 Adelie (0.430435 0.200000 0.369565)
2) flipper_length< 207 142 43 Adelie (0.697183 0.295775 0.007042)
4) bill_length< 42.2 94 1 Adelie (0.989362 0.010638 0.000000) *
5) bill_length>=42.2 48 7 Chinstrap (0.125000 0.854167 0.020833)
10) island=Biscoe,Torgersen 7 1 Adelie (0.857143 0.000000 0.142857) *
11) island=Dream 41 0 Chinstrap (0.000000 1.000000 0.000000) *
3) flipper_length>=207 88 4 Gentoo (0.000000 0.045455 0.954545) *
prediction = learner$predict(task, row_ids = split$test)
prediction
<PredictionClassif> for 114 observations:
row_ids truth response
2 Adelie Adelie
3 Adelie Adelie
12 Adelie Adelie
---
340 Chinstrap Gentoo
341 Chinstrap Chinstrap
344 Chinstrap Chinstrap
prediction$score(msr("classif.acc"))
classif.acc
0.9386
In this example, we trained a decision tree on a subset of the penguins
dataset, made predictions on the rest of the data and then evaluated these with the accuracy measure. In Chapter 2 we will break this down in more detail.
The mlr3
interface also lets you run more complicated experiments in just a few lines of code:
library(mlr3verse)
tasks = tsks(c("breast_cancer", "sonar"))
glrn_rf_tuned = as_learner(ppl("robustify") %>>% auto_tuner(
tnr("grid_search", resolution = 5),
lrn("classif.ranger", num.trees = to_tune(200, 500)),
rsmp("holdout")
))
glrn_rf_tuned$id = "RF"
glrn_stack = as_learner(ppl("robustify") %>>% ppl("stacking",
lrns(c("classif.rpart", "classif.kknn")),
lrn("classif.log_reg")
))
glrn_stack$id = "Stack"
learners = c(glrn_rf_tuned, glrn_stack)
bmr = benchmark(benchmark_grid(tasks, learners, rsmp("cv", folds = 3)))
bmr$aggregate(msr("classif.acc"))
task_id learner_id classif.acc
1: breast_cancer RF 0.9649
2: breast_cancer Stack 0.9342
3: sonar RF 0.7536
4: sonar Stack 0.7246
In this (much more complex!) example we chose two tasks and two learners and used automated tuning to optimize the number of trees in the random forest learner (Chapter 4), and a machine learning pipeline that imputes missing data, collapses factor levels, and stacks models (Chapter 7 and Chapter 8). We also showed basic features like loading learners (Chapter 2) and choosing resampling strategies for benchmarking (Chapter 3). Finally, we compared the performance of the models using the mean accuracy with three-fold cross-validation.
You will learn how to do all this and more in this book.
1.5 The mlr3
Ecosystem
Throughout this book, we often refer to mlr3
, which may refer to the single mlr3
base package but usually refers to all packages in our ecosystem, this should be clear from context. The mlr3
package provides the base functionality that the rest of the ecosystem depends on for building more advanced machine learning tools. Figure 1.1 shows the packages in our ecosystem that extend mlr3
with capabilities for preprocessing, pipelining, visualizations, additional learners, additional task types, and much more.
A complete and up-to-date list of extension packages can be found at https://mlr-org.com/ecosystem.html.
As well as packages within the mlr3
ecosystem, software in the mlr3verse
also depends on the following popular and well-established packages:
-
R6
: The class system predominantly used inmlr3
. -
data.table
: High-performance extension of R’sdata.frame
. -
digest
: Cryptographic hash functions. -
uuid
: Generation of universally unique identifiers. -
lgr
: Configurable logging library. -
mlbench
andpalmerpenguins
: Machine learning datasets. -
future
/future.apply
/parallelly
: For parallelization (Section 10.1). -
evaluate
: For capturing output, warnings, and exceptions (Section 10.2).
We build on R6
for object orientation and data.table
to store and operate on tabular data. As both are core to mlr3
we briefly introduce both packages for beginners; in-depth expertise with these packages is not necessary to work with mlr3
.
1.5.1 R6 for Beginners
R6
is one of R’s more recent paradigms for object-oriented programming. If you have experience with any (class) object-oriented programming then R6 should feel familiar. We focus on the parts of R6 that you need to know to use mlr3
.
Objects are created by constructing an instance of an R6Class
variable using the $new()
initialization method. For example, say we have implemented a class called Foo
, then foo = Foo$new(bar = 1)
would create a new object of class Foo
and set the bar
argument of the constructor to the value 1
. In practice, we implement a lot of sugar functionality (Section 1.6) in mlr3
that make construction and access a bit more convenient.
Some R6
objects may have mutable states that are encapsulated in their fields, which can be accessed through the dollar, $
, operator. Continuing the previous example, we can access the bar
value in the foo
object by using foo$bar
or we could give it a new value, e.g. foo$bar = 2
. These fields can also be ‘active bindings’, which perform additional computations when referenced or modified.
In addition to fields, methods allow users to inspect the object’s state, retrieve information, or perform an action that changes the internal state of the object. For example, in mlr3
, the $train()
method of a learner changes the internal state of the learner by building and storing a model. Methods that modify the internal state of an object often return the object itself. Other methods may return a new R6 object. In both cases, it is possible to ‘chain’ methods by calling one immediately after the other using the $
-operator; this is similar to the %>%
-operator used in tidyverse
packages. For example, Foo$bar()$hello_world()
would run the $bar()
method of the object Foo
and then the $hello_world()
method of the object returned by $bar()
(which may be Foo
itself).
Fields and methods can be public or private. The public fields and methods define the API to interact with the object. In mlr3
, you can safely ignore private methods unless you are looking to extend our universe by adding a new class (Chapter 10).
Finally, R6
objects are environments
, and as such have reference semantics. This means that, for example, foo2 = foo
does not create a new variable called foo2
that is a copy of foo
. Instead, it creates a variable called foo2
that references foo
, and so setting foo$bar = 3
will also change foo2$bar
to 3
and vice versa. To copy an object, use the $clone(deep = TRUE)
method, so to copy foo
: foo2 = foo$clone(deep = TRUE)
.
$clone()
For a longer introduction, we recommend the R6
vignettes found at https://r6.r-lib.org/; more detail can be found in https://adv-r.hadley.nz/r6.html.
1.5.2 data.table for Beginners
The package data.table
implements data.table()
, which is a popular alternative to R’s data.frame()
. We use data.table
because it is blazingly fast and scales well to bigger data.
As with data.frame
, data.table
s can be constructed with data.table()
or as.data.table()
:
library(data.table)
# converting a matrix with as.data.table
as.data.table(matrix(runif(4), 2, 2))
V1 V2
1: 0.2989 0.5856
2: 0.1594 0.1488
# using data.table
dt = data.table(x = 1:6, y = rep(letters[1:3], each = 2))
dt
x y
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
6: 6 c
data.table
s can be used much like data.frame
s, but they provide additional functionality that makes complex operations easier. For example, data can be summarized by groups with a by
argument in the [
operator and they can be modified in-place with the :=
operator.
# mean of x column in groups given by y
dt[, mean(x), by = "y"]
y V1
1: a 1.5
2: b 3.5
3: c 5.5
# adding a new column with :=
dt[, z := x * 3]
dt
x y z
1: 1 a 3
2: 2 a 6
3: 3 b 9
4: 4 b 12
5: 5 c 15
6: 6 c 18
Finally data.table
also uses reference semantics so you will need to use copy()
to clone a data.table
. For an in-depth introduction, we recommend the vignette “Introduction to Data.table” (2023).
1.6 Essential mlr3 Utilities
mlr3
includes a few important utilities that are essential to simplifying code in our ecosystem.
Sugar Functions
Most objects in mlr3
can be created through convenience functions called helper functions or sugar functions. They provide shortcuts for common code idioms, reducing the amount of code a user has to write. For example lrn("regr.rpart")
returns the learner without having to explicitly create a new R6 object. We heavily use sugar functions throughout this book and provide the equivalent “full form” for complete detail at the end of each chapter. The sugar functions are designed to cover the majority of use cases for most users, knowledge about the full R6
backend is only required if you want to build custom objects or extensions.
Many object names in mlr3
are standardized according to the convention: mlr_<type>_<key>
, where <type>
will be tasks
, learners
, measures
, and other classes that will be covered in the book, and <key>
refers to the ID of the object. To simplify the process of constructing objects, you only need to know the object key and the sugar function for constructing the type. For example: mlr_tasks_mtcars
becomes tsk("mtcars")
;mlr_learners_regr.rpart
becomes lrn("regr.rpart")
; and mlr_measures_regr.mse
becomes msr("regr.mse")
. Throughout this book, we will refer to all objects using this abbreviated form.
Dictionaries
mlr3
uses dictionaries to store R6 classes, which associate keys (unique identifiers) with objects (R6 objects). Values in dictionaries are often accessed through sugar functions that retrieve objects from the relevant dictionary, for example lrn("regr.rpart")
is a wrapper around mlr_learners$get("regr.rpart")
and is thus a simpler way to load a decision tree learner from mlr_learners
. We use dictionaries to group large collections of relevant objects so they can be listed and retrieved easily. For example, you can see an overview of available learners (that are in loaded packages) and their properties with as.data.table(mlr_learners)
or by calling the sugar function without any arguments, e.g. lrn()
.
mlr3viz
mlr3viz
includes all plotting functionality in mlr3
and uses ggplot2
under the hood. We use theme_minimal()
in all our plots to unify our aesthetic, but as with all ggplot
outputs, users can fully customize this. mlr3viz
extends fortify
and autoplot
for use with common mlr3
outputs including Prediction
, Learner
, and BenchmarkResult
objects (which we will introduce and cover in the next chapters). We will cover major plot types throughout the book. The best way to learn about mlr3viz
is through experimentation; load the package and see what happens when you run autoplot
on an mlr3
object. Plot types are documented in the respective manual page that can be accessed through ?autoplot.<class>
, for example, you can find different types of plots for regression tasks by running ?autoplot.TaskRegr
.
1.7 Design Principles
Learning from over a decade of design and adaptation from mlr
to mlr3
, we now follow these design principles in the mlr3
ecosystem:
-
Object-oriented programming. We embrace
R6
for a clean, object-oriented design, object state changes, and reference semantics. This means that the state of common objects (e.g. tasks (Section 2.1) and learners (Section 2.2)) is encapsulated within the object, for example, to keep track of whether a model has been trained, without the user having to worry about this. We also use inheritance to specialize objects, e.g. all learners are derived from a common base class that provides basic functionality. -
Tabular data. Embrace
data.table
for its top-notch computational performance as well as tabular data as a structure that can be easily processed further. -
Unified tabular input and output data formats. This considerably simplifies the API and allows easy selection and “split-apply-combine” (aggregation) operations. We combine
data.table
andR6
to place references to non-atomic and compound objects in tables and make heavy use of list columns. -
Defensive programming and type safety. All user input is checked with
checkmate
(Lang 2017). We usedata.table
, which has behavior that is more consistent than several base R methods (e.g., indexingdata.frame
s simplifies the result when thedrop
argument is omitted). And we have extensive unit tests! -
Light on dependencies. One of the main maintenance burdens for
mlr
was to keep up with changing learner interfaces and behavior of the many packages it depended on. We require far fewer packages inmlr3
, which makes installation and maintenance easier. We still provide the same functionality, but it is split into more packages that have fewer dependencies individually. -
Separation of computation and presentation. Most packages of the
mlr3
ecosystem focus on processing and transforming data, applying ML algorithms, and computing results. Our core packages do not provide visualizations because their dependencies would make installation unnecessarily complex, especially on headless servers (i.e., computers without a monitor where graphical libraries are not installed). Hence, visualizations of data and results are provided inmlr3viz
.
1.8 Citation
Please cite this chapter as:
Kotthoff L, Sonabend R, Foss N, Bischl B. (2024). Introduction and Overview. In Bischl B, Sonabend R, Kotthoff L, Lang M, (Eds.), Applied Machine Learning Using mlr3 in R. CRC Press. https://mlr3book.mlr-org.com/introduction_and_overview.html.
@incollection{citekey,
author = "Lars Kotthoff and Raphael Sonabend and Natalie Foss and Bernd Bischl",
title = "Introduction and Overview",
booktitle = "Applied Machine Learning Using {m}lr3 in {R}",
publisher = "CRC Press", year = "2024",
editor = "Bernd Bischl and Raphael Sonabend and Lars Kotthoff and Michel Lang",
url = "https://mlr3book.mlr-org.com/introduction_and_overview.html"
}