You may have noticed that when installing packages in the notebook it can take a while. It could be minutes, hours in extreme cases, to install the suite of packages your project requires. This is especially tedious if you need to do this every time a job runs, or each morning when your cluster is started.
Clusters are ephemeral and by default have no persistent storage, therefore installed packages will not be available on restart.
By default Databricks installs packages from CRAN. CRAN does not provide pre-compiled binaries for Linux (Databricks clusters’ underlying virtual machines are Linux, Ubuntu specifically).
With our new found knowledge we can make installing R packages within Databricks significantly faster. There are multiple ways to solve this, each differing slightly, but fundamentally the same.
3.1 Setting Repo within Notebook
The quickest method is to follow the wizard and adjust the repos option:
# set the user agent string otherwise pre-compiled binarys aren't used# e.g. selecting Ubuntu 22.04 in wizard options(1HTTPUserAgent =sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])),repos ="https://packagemanager.posit.co/cran/__linux__/jammy/latest")
1release <-system("lsb_release -c --short", intern = T)# set the user agent string otherwise pre-compiled binarys aren't usedoptions(HTTPUserAgent =sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])),repos =paste0("https://packagemanager.posit.co/cran/__linux__/", release, "/latest"))
1
system is used to run the command to retrieve the release code name
The downside of this method is that it requires every notebook to adjust the repos and HTTPUserAgent options.
3.2 Cluster Environment Variable & Init Script
Databricks clusters allow specification of environment variables, there is a specific variable (DATABRICKS_DEFAULT_R_REPOS) that can be set to adjust the default repository for the entire cluster.
You can again refer to the wizard, the environment variables section of cluster should be:
Unfortunately this isn’t as dynamic as the first option and you still need to set the HTTPUserAgent in Rprofile.site via an init script.
The init script will be:
#!/bin/bash# Append changes to Rprofile.sitecat<<EOF>>"/etc/R/Rprofile.site"options( HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])))EOF
Important
Due to how Databricks starts up the R shell for notebook sessions it’s not straightforward to adjust the repos option in an init script alone.
DATABRICKS_DEFAULT_R_REPOS is referenced as part of the startup process afterRprofile.site is executed and will override any earlier attempt to adjust repos.
Therefore you’ll need to use both the init script and the environment variable configuration.
3.3 Setting Repo for Cluster Library
Note
Similar to setting DATABRICKS_DEFAULT_R_REPOS this requires the HTTPUserAgent also to be set and it’s unlikely to be helpful other than for it’s purpose of installing a package to make it available for all cluster users.
Cluster libraries can install R packages and support specification of the repository.