Intro
As we come upon the start of group projects and student created visualizations, there is the inevitable issue that a group will commit a data set that is too large into their repository. Generally, this means that the repos are treated as problematic on push to GitHub when files are:
- Over 50 mb causes a warning for the commit.
- Over 100 mb yields an error and rejection of commits .
Similarly, a repository is flagged as problematic if the total size of the repository is over 1 GB.
Workflow
There are two workflows given by GitHub that seek to address the “large file” or “sensitive data” problem:
- Removing a file using
git rm
andgit commit --amend
- Removing sensitive data from a repository using
bfg
orgit filter-branch
Neither approach as stated allowed for a quick clean up of a repository. Though, the BFG repo cleaner highlighted by the second workflow was insightful. With additional flags provided to the BFG repo cleaner there was a quick rewrite of commits that allowed the repos to go back to a functional state.
For future use, the workflow is given as:
# Download the BFG Repo Cleaner to working directory
wget http://repo1.maven.org/maven2/com/madgag/bfg/1.13.0/bfg-1.13.0.jar
# Clone the repository to a new directory
git clone git@github.com:org-name/sample-repo.git sample-repo
# Remove all files greater than 100M with disables BLOB file protection
# in sample-repo
java -jar bfg-1.13.0.jar --no-blob-protection --strip-blobs-bigger-than 100M sample-repo
# Change into the repo folder after purging
cd sample-repo
# Strip out the unwanted dirty data
git reflog expire --expire=now --all && git gc --prune=now --aggressive
Common Issues
Did BFG Cleaner ask if your repo needs to be re-packed through a warning message related to “large blobs”?
Using repo : /cloud/project/.git
Scanning packfile for large blobs completed in 16 ms.
Warning : no large blobs matching criteria found in packfiles - does the repo need to be packed?
To repackage the repository, run:
git gc
Appendix
Sample output of a successful clean build will be:
Using repo : /cloud/project/.git
Scanning packfile for large blobs: 30
Scanning packfile for large blobs completed in 56 ms.
Found 1 blob ids for large blobs - biggest=120226125 smallest=120226125
Total size (unpacked)=120226125
Found 0 objects to protect
Found 4 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/origin/HEAD, refs/remotes/origin/master
Protected commits
-----------------
You're not protecting any commits, which means the BFG will modify the
contents of even *current* commits.
This isn't recommended - ideally, if your current commits are dirty, you
should fix up your working copy and commit that, check that your build still
works, and only then run the BFG to clean up your history.
Cleaning
--------
Found 5 commits
Cleaning commits: 100% (5/5)
Cleaning commits completed in 375 ms.
Updating 1 Ref
--------------
Ref Before After
---------------------------------------
refs/heads/master | c2da9bdd | a732bc9c
Updating references: 100% (1/1)
...Ref update completed in 18 ms.
Commit Tree-Dirt History
------------------------
Earliest Latest
| |
. . D D D
D = dirty commits (file tree fixed)
m = modified commits (commit message or parents changed)
. = clean commits (no changes to file tree)
Before After
-------------------------------------------
First modified commit | b56cbdde | 6056ad01
Last dirty commit | c2da9bdd | a732bc9c
Deleted files
-------------
Filename Git id
----------------------------------
psam_p17 | d8be4dfb (114.7 MB)
psam_p17.csv | d8be4dfb (114.7 MB)