3 Faster Package Installs

You may have noticed that when installing packages in the notebook it can take a while. It could be minutes, hours in extreme cases, to install the suite of packages your project requires. This is especially tedious if you need to do this every time a job runs, or each morning when your cluster is started.

Clusters are ephemeral and by default have no persistent storage, therefore installed packages will not be available on restart.

By default Databricks installs packages from CRAN. CRAN does not provide pre-compiled binaries for Linux (Databricks clusters’ underlying virtual machines are Linux, Ubuntu specifically).

Posit to save the day! Posit provides a public package manager that has all packages from CRAN (and Bioconductor!). There is a helpful wizard to get started.

With our new found knowledge we can make installing R packages within Databricks significantly faster. There are multiple ways to solve this, each differing slightly, but fundamentally the same.

3.1 Setting Repo within Notebook

The quickest method is to follow the wizard and adjust the repos option:

# set the user agent string otherwise pre-compiled binarys aren't used
# e.g. selecting Ubuntu 22.04 in wizard 
options(
1  HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])),
  repos = "https://packagemanager.posit.co/cran/__linux__/jammy/latest"
)

1: HTTPUserAgent is required when using R 3.6 or later

This works well but not all versions of the Databricks Runtime use the same version of Ubuntu.

It’s easier to detect the Ubuntu release code name dynamically:

1release <- system("lsb_release -c --short", intern = T)

# set the user agent string otherwise pre-compiled binarys aren't used
options(
  HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])),
  repos = paste0("https://packagemanager.posit.co/cran/__linux__/", release, "/latest")
)

1: system is used to run the command to retrieve the release code name

The downside of this method is that it requires every notebook to adjust the repos and HTTPUserAgent options.

3.2 Cluster Environment Variable & Init Script

Databricks clusters allow specification of environment variables, there is a specific variable (DATABRICKS_DEFAULT_R_REPOS) that can be set to adjust the default repository for the entire cluster.

You can again refer to the wizard, the environment variables section of cluster should be:

DATABRICKS_DEFAULT_R_REPOS=<posit-package-manager-url-goes-here>

Unfortunately this isn’t as dynamic as the first option and you still need to set the HTTPUserAgent in Rprofile.site via an init script.

The init script will be:

#!/bin/bash
# Append changes to Rprofile.site
cat <<EOF >> "/etc/R/Rprofile.site"
options(
  HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"]))
)
EOF

Important

Due to how Databricks starts up the R shell for notebook sessions it’s not straightforward to adjust the repos option in an init script alone.

DATABRICKS_DEFAULT_R_REPOS is referenced as part of the startup process after Rprofile.site is executed and will override any earlier attempt to adjust repos.

Therefore you’ll need to use both the init script and the environment variable configuration.

3.3 Setting Repo for Cluster Library

Note

Similar to setting DATABRICKS_DEFAULT_R_REPOS this requires the HTTPUserAgent also to be set and it’s unlikely to be helpful other than for it’s purpose of installing a package to make it available for all cluster users.

Cluster libraries can install R packages and support specification of the repository.