Using groundhog with GitHub & GitLab Packages

updated: 2022 03 11

The current version of groundhog on CRAN, v1.5.0, can install and load CRAN packages.
The development version of groundhog, v1.9.9.9999, to become v2.0.0 upon release on CRAN later in 2022, can in addition install and load packages from git repositories at GitHub and Gitlab.

You may install it with:

remotes::install_github('CredibilityLab/groundhog')

And then use groundhog to load packages from git repositories by just adding the “user/” before the package name, like this:

groundhog.library('crsh/papaja','2022-03-01')
groundhog.library('gitlab::jimhester/covr','2022-03-01')

As is the case with CRAN packages, groundhog can store multiple versions of the same git package side by side locally. You may use the development version of a package (as of a fixed date) for one script, and the CRAN version as of another fixed date for another project, in a modular and independent fashion, by just changing what goes inside the groundhog.library() parentheses.

 

Note: Version control for git packages is less reliable than for CRAN packages.

Lesser reproducibility is due to two reasons, both ultimately arising from the fact that developers have more control to break packages on git repositories than on CRAN. Specifically, on git repositories, developers can (1) delete packages, and (2) submit incorrect dates for when a package version/change was made, resulting in imperfect version control. See details below.

Details on why code based on github packages is less reliably reproducible

1. Deletion.
Packages on CRAN are basically never deleted (there are a handful of examples out of 10,000s of packages). But packages on Github and Gitlab can be deleted at any time. Since groundhog does not (currently) make a backup of git repositories, if a developer were to delete a package, there would be no simple way to reproduce code that depends on that package.

2. Incorrect timestamps: commit vs push times
Another way in which code based on git packages is less reliably reproducible than code based on CRAN packages, is that the record of when each version of a package was available on a git repository is less reliable than the publication date on CRAN. Specifically, git repositories store the timestamp when a change was made on the developer’s copy of the repository (when they did the ‘commit’), and not when such change was submitted to the public repository (‘pushed’). It is possible for developers to push changes long after they committed them, and when they do, the git repository stores as the timestamp the first date, when it was commited, not when it was pushed.  This poses a challenge to the reproducibility of packages based on git packages.

So for example, imagine a package is changed on March 1st by a developer, and the change is saved locally on that day (‘commited’). But it is not saved to Github, ‘pushed’, until March 15th. This would mean that the public version of the package officially available on, say March 10th, would change retroactively upon the pushed changed. Loading the package, with date March 10th, on  March 13th would load the version saved prior to March 1st, while on March 16th it would load the version saved on March 1st.

Groundhog does not currently address this issue based on a cost-benefit calculation.  On the one hand, delays between commit and push times are not likely to be substantial or frequent, and for such delayed time to coincide with a chosen date by a user, and for the change to break a script, seems like a sufficiently small probability to ignore, when we consider that, on the other hand, to bypass this problem, one would need to create a database that periodically updates which changes have been pushed to a 10,000s  of packages available on git repositories. This decision may be revised if evidence that reproducibility is in fact impacted by time differences between committed and pushed changes were to arise.