The Sisyphean workflow:
Targets is an extension developed by Will Landau that allows you to organize a project in the form of a processing pipeline, made up of different stages, and automatically managing the dependencies between them.
This organization has several advantages:
Transform blocks of code into functions by wrapping them into <- function() { my_code }
and setting appropriate parameters.
In targets, a data analysis pipeline is a collection of target objects that express the individual steps of the workflow
Quarto
report)tar_target()
, providing the name of the target as first argument and the function to create it as a second argument, there is a third optional argument to specify the ‘type’ of the targetThe pipeline is written in an R file, by default Targets
looks for a _targets.R
file located at the root of your project.
But this can be changed and the pipeline can be defined in an R
file of your choice, you’ll have to set it up using the following (e.g. in your make.R
file):
Note that we also specified a custom location to store the targets.
library(targets)
list(
# Make the workflow depends on the raw data file
tar_target(name = raw_data_file, command = here::here("data", "airquality.csv"),
format = "file"),
# Read the data and return a data.frame
tar_target(name = raw_data, command = read.csv(raw_data_file)),
# Transform the data
tar_target(data, raw_data %>% filter(!is.na(Ozone))),
# Explore the data (custom function)
tar_target(hist, hist(data$Ozone)),
# Model the data
tar_target(fit, lm(Ozone ~ Wind + Temp, data))
)
library(targets)
list(
# Make the workflow depends on the raw data file
tar_target(name = raw_data_file, command = here::here("data", "airquality.csv"),
format = "file"),
# Read the data and return a data.frame
tar_target(name = raw_data, command = read.csv(raw_data_file)),
# Transform the data
tar_target(data, raw_data %>% filter(!is.na(Ozone))),
# Explore the data (custom function)
tar_target(hist, hist(data$Ozone)),
# Model the data
tar_target(fit, lm(Ozone ~ Wind + Temp, data))
)
All target script files have these requirements:
targets
package itself.targets::tar_source()
.tar_target()
function. Each target is an intermediate step of the workflow. At minimum, a target must have a name and an R expression but it’s better if it uses a function that you defined in R/
.tar_target()
objects.Once the pipeline is ready and inspected via targets::tar_visnetwork(), you can run it with:
And inspect it again:
Targets objects are stored in the Targets store
but you d’ont need to think about it when retrieving your results.
Just load or read targets with their names:
Loading
a target in the current workspace:Targets objects are stored in the Targets store
but you d’ont need to think about it when retrieving your results.
Just load or read targets with their names:
Reading
a target value and assign it to a new object:Let’s modidify the target “hist” to use ggplot2 instead of R base graphics.
We will create a function that generate a ggplot2
histogram object and write it’s definition in R/functions.R
:
And modify the pipeline to use it:
library(targets)
tar_source() #load functions in R/
list(
# Make the workflow depends on the raw data file
tar_target(name = raw_data_file, command = here::here("data", "airquality.csv"),
format = "file"),
# Read the data and return a data.frame
tar_target(name = raw_data, command = read.csv(raw_data_file)),
# Transform the data
tar_target(data, raw_data |> dplyr::filter(!is.na(Ozone))),
# Explore the data (custom function)
tar_target(hist, make_hist(data)),
# Model the data
tar_target(fit, lm(Ozone ~ Wind + Temp, data))
)
And inspect it again:
And run it again:
Write an R function
Add a target to the pipeline
Visualize the pipeline
Make the pipeline
Check the results
Write a function
Write an R function
Add a target to the pipeline
Visualize the pipeline
Make the pipeline
Check the results
Write a function
Write an R function
Add a target to the pipeline
Visualize the pipeline
Make the pipeline
Check the results
Write a function
Write an R function
Add a target to the pipeline
Visualize the pipeline
Make the pipeline
Check the results
Write a function
To reproducibly track an external input file, you need to define a new target that has:
A command that returns the file path as a character vector, and
specify format = "file"
in tar_target()
.
When the target runs in the pipeline, the returned character vector gets recorded, and Targets
watches the data file and invalidates the target when that file changes.
To track multiple files this way, simply define a multi-element character vector where each element is a path.
Each element can also be a directory, but this directory must not be empty at the time the target runs.
A target with format = "file"
treats the entire set of files as an irreducible bundle.
That means in order to “branch” over files, each file should be associated with its own target.
This is not optimal at all. Using the argument pattern
solve this by providing a way to dynamically create targets for each input file.
Here is a pipeline that begins with data files and loads each into a different dynamic branch.
The tar_files()
function from the tarchetypes
package is shorthand for the first two targets above.
Output files have the same mechanics as input files. The target uses format = "file"
, and the return value is a vector of paths to generated files.
The only difference here is that the target’s R command writes to storage before it returns a value.
For example, here is an output file target that saves a visualization.
Here, our custom save_plot_and_return_path()
function does exactly what the name describes.
If you render a Quarto report as part of a target, the report should be lightweight: mostly prose, minimal code, fast execution, and no output other than the rendered HTML/PDF document.
Important
In other words, Quarto reports are just targets that document prior results.
The bulk of the computation should have already happened upstream, and the most of the code chunks in the report itself should be terse calls to tar_read()
and tar_load()
.
The report depends on targets fit
and hist
.
The use of tar_read()
and tar_load()
allows us to run the report outside the pipeline.
As long as the Targets store
folder has data on the required targets from a previous tar_make()
, you can open the RStudio IDE, edit the report, and click the Render button like you would for any other Quarto report.
The target definition function to render Quarto documents is part of the tarchetypes
R
package and looks like this.
Because symbols fit
and hist
appear in the report via tar_load()
and tar_read()
, targets
knows that report
depends on fit
and hist
.
When we put the report
target in the pipeline, these dependency relationships show up in the graph.
library(targets)
tar_source() #load functions in R/
list(
# Make the workflow depends on the raw data file
tar_target(name = raw_data_file, command = here::here("data", "airquality.csv"),
format = "file"),
# Read the data and return a data.frame
tar_target(name = raw_data, command = read.csv(raw_data_file)),
# Transform the data
tar_target(data, raw_data |> dplyr::filter(!is.na(Ozone))),
# Explore the data (custom function)
tar_target(hist, make_hist(data)),
# Model the data
tar_target(fit, lm(Ozone ~ Wind + Temp, data)),
tarchetypes::tar_quarto(report, "report.qmd")
)
When we put the report
target in the pipeline, these dependency relationships show up in the graph.
As always, tar_make()
will run the pipeline and compile the report.
Optimize your workflow
Reproducible for others and your future self
100% sure to be reproducible
You can count on targets’s brain and work in a clean environment
The package is well maintained and documented with a great manual