GRAN: Groundhog R Archive Neighbor
updated: 2023 04 19
Groundhog uses a database of all packages ever posted to CRAN to determine which version of each package to load and install if it is not already installed (that database: https://groundhogr.com/cran.toc.rds). When a package needs to be installed, groundhog looks first for a binary version, which installs more quickly (CRAN offers binaries for Windows and Mac, but not Linux). CRAN, however, deletes binaries once new versions of packages are released, so groundhog needs to find binaries for older versions elsewhere. Until 2023 Microsoft provided old binaries, through their archive of daily snapshots of CRAN, which they called MRAN. Groundhog would thus download the binaries it needed from MRAN. But, Microsoft abandoned MRAN in 2023.
Starting with groundhog v3.0.0, old binaries are obtained from a custom created repository called GRAN: Groundhog R Archive Neighbor. This page gives some background on GRAN.
URL for files
The old binaries are stored in an S3 bucket: http://gran.groundhogr.com, (hosted by wasabi.com, not Amazon).
S3 buckets are difficult to navigate, worth knowing the URL structure, it is:
http://gran.groundhogr.com/<operating system>/<r version>/<download date>/<file name>.
For example, this is the URL for a rio
package binary
http://gran.groundhogr.com/windows/4.1/2021-06-02/rio_0.5.26.zip
The date in the URL is when that binary was obtained from CRAN
Index of all binaries
GRAN has over 700,000 binaries. There are separate dataframes indexing all binaries available for a given operating system and R-version, all of them are publicly available from https://groundhogr.com/gran.toc
The index page in that URL also lists an R script that will combine all these files into a single master index.
Under the hood: which binaries are saved?
The approach to archiving taken by GRAN is quite different from traditional snapshots like the one MRAN used to do, or the one Posit (R-studio) does. Instead of copying everything in CRAN by ‘brute force’ on a daily basis, judgment guides which files will actually be needed (by groundhog) to ensure reproducibility, and only those files are saved. A lot of files on CRAN do not need to be archived on a separate backup archive like GRAN. For example, consider these three big categories of files that are on CRAN, and thus copied by MRAN and Posit but are not actually needed:
- Source files for all packages ever posted to CRAN
(CRAN does store old source package versions, no need to copy those, especially not daily!) - Binary files for non-current versions of R
(at any given time, on CRAN we can find binaries for packages built for previous and development versions of R, we don’t need those either. For reproducibility, we should use the version of R that matches the desired date anyway) - Already saved binaries for a given package version
Oversimplifying things at first, we just need one file for a given os, R-version, package_version, so once we have the binary for rio 0.5.16, for R-4.1 on Windows, we don’t need to download it again. MRAN, would download that same file every day, even if it has not changed in 3 years, it keeps re-downloading it and re-archiving it. Same with Posit. GRAN keeps in concept only one copy of each binary, but read the ‘multiple copies’ subsection below to see deviations from this general rule.
Thanks to (1-3) the total footprint of GRAN is small. When it started in 2023, including Mac and Windows binaries for all packages posted to R since 2014, it was 800 Gb total. MRAN takes less than a week, instead of 10 years, to fill that much storage.
GRAN is maintained up to date through an R script that daily copies any new CRAN binaries to GRAN.
Well, there are multiple copies of some binaries
While point 3 above implies just a single copy of a given OS, R-version, package_version would be saved, there are in fact in GRAN many packages with multiple files for the exact same combination. In other words, GRAN does still have some redundancy. The reason is that in theory it is possible for a binary for the same package version to stop working when a dependency is updated, and that means that sometimes a binary needs to be rebuilt. Using the old binary instead would create a problem. There is lack of information as to how likely this situation is, how often it actually happens, with few documented cases of it. But to err on the side of caution, GRAN has some protection against this rare situation. The approach is different for Windows and Mac binaries because Windows binaries are rebuilt almost daily, without any explicit justification, just as general policy, while Mac binaries are rebuilt only if the need to rebuild them actually arises.
For Mac, then, GRAN saves all binaries ever posted to CRAN. Because it is assumed that if a new binary was created, there was a reason for it.
For Windows we cannot rely on the judgment behind rebuilding because there is no judgment, it is just done, and so saving every binary would move us towards the excessive redundancy model followed by Posit and MRAN. For Windows, then, the compromise is to download on the 1st of each month all binaries for ‘critical’ packages, even if GRAN already has that binary. “Critical” packages are defined as being in the top-4% of number of reverse dependencies (i.e., many packages depend on that package working, so if it breaks, a lot of other packages break), or as taking more than 4 minutes to install from source. In addition, if the Mac binary for a package was rebuilt, GRAN saves a new copy of that binary for Windows as well, with the premise that if something went wrong and was detected by the Mac maintainers, maybe it also impacted Windows binaries. Finally, keep in mind that if a binary failed to install properly, users can always install from source. The probability of failure seems very small, and the cost is small as well. So limited insurance against it seems appropriate.
This approach of redundancy avoidance achieves very substantial reduction in storage needs (over 90%), and leaves minimal room for an incompatibility to arise.