File too large checked in

From genomewiki
Revision as of 01:55, 25 April 2021 by Galt (talk | contribs)
Jump to navigationJump to search

FILE TOO LARGE CHECKED IN and HOW TO FIX IT

When I do git push I see this error:

Exceeds file size limit 2200000. 

WHY BIG FILES ARE NOT ALLOWED

The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. The restriction has been implemented as a hook in the central shared repo that developers push to. We already did not want large files to be checked-in, and during the transition from CVS to git, many huge test files were removed. Also, github has size restrictions which have to be honored. And people will find kent repo excessively bloated and hard to use without this size restriction. This is a repository of source code text, which is small.

WHY PEOPLE CHECK IN BIG FILES

Because developers are encouraged to make standard tests subdirectory for their kent utilities, there are testing files which get checked in, and unless care is exercised, it is very easy for programmers who deal with giant genomics files to accidentally check them in. Also, sometimes people want to check in PDF documents and some reasonably sized JPG or PNG images. Please use JPG when it is a camera image for better compression and smaller size. PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a small number of colors. And sometimes, people just make a mistake, or forget about the limit.

WHY DO I FIND OUT ABOUT IT SO LATE?

When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. There are some incomplete and limited ways to add a hook that would detect it during git commit. We are looking into ways to improve this so you could get earlier warning about a file being to large.

WHY IS IT SO HARD TO FIX

Since git is a powerful source code control system, you might hope that it would easily handle this situation. However, because git builds immutable trees, which are a good thing for so many purposes, removing something or changing it requires changing the git history of the branch. We must avoid pushing large files to the shared repo main branch. Once it goes there, hundreds of users all over the world will pick it up automatically, and there is no way to go around fixing up all of those copies to remove large files from their history.

However, git can indeed fix the history of a branch in your local git tree which has not been pushed. And that is what we are going to do here.

FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN

In order to fix your branch, you are going to have to use some form of git rebase on it, otherwise, it could never be fixed.

A common case is where a user realizes the mistaken large file, and uses git rm to remove it, or uses git add to replace it with a smaller version of the file, such as a test file or jpg image or pdf, and git commit. So the large file no longer exists on the tip of their branch. However, it does exist in the history.

As usual with all of this stuff, if you have unchecked in stuff, check it in or use stash to clean up your repo for action.

git add     # this is often a good choice.
git commit

or

git stash  # only if needed

SQUASH?

If you were going to squash your development branch anyways, then you can just merge --squash into the master branch, and the system is smart enough to skip the large file that no longer exists when it does so.

 git checkout master

As usual, may have to handle git conflicts during any merge.

 git merge --squash myDevBranch

If your changes were on master, and not on a dev branch, turn your master branch into a dev branch, and then create a new master branch, and squash that onto it. Only do this if it makes sense.

 git fetch  # update origin/master
 git branch -m master tempMaster
 git branch master origin/master

Look at .git/config to fix master branch tracking if needed.

 git checkout master
 git merge --squash tempMaster
 git push

After a few days, you can delete tempMaster if you do not need it, this should also allow git garbage collection to clean that large file from your own local repo.

 git branch -D tempMaster
 

The benefit of SQUASH is that it is simple and you are done.

The disadvantage is that you lose your commit history, and all those changes just became one big commit on master branch. This is just right for many users.


GIT CHERRY-PICK?

NOT RECOMMENDED If you only have a handful of commits, and you know which ones they are, you can try to use this method. It is a tedious. You would have to use git log to find which specific commits need to be saved. You might have to turn master branch into a dev or temp branch as above, create new master, and then pick specific commits from the temp branch onto master. You may still need to do a git rebase -i if you cannot not make the large file go away simply by skipping a no longer needed commit or two.

GIT REBASE

Use the squash method (see above) if that works for you.

But otherwise, use git rebase.

Git rebase is our friend for crises like this. But it has to be used properly.

HAVE YOU MERGED FROM MASTER?

In particular, if you have merged from master already, before you noticed the large file error message later during pushing, you could easily have dozens of your own commits and hundreds of commits made by other people from pulling in from the master branch which has commits from the entire team, it might even be months since you last successfully pushed, but you already pulled several times.

So if you have done even one merge from master before you discovered the problem, which is pretty common to happen, then you should proceed with GIT REBASE TO TIP.

If you ABSOLUTELY certain that you have NOT git pulled even once on your problem branch, then skip this step and go ahead to the GIT REBASE INTERACTIVE section.

GIT REBASE TO TIP

git rebasing of your entire branch onto the tip of the master branch tree is super useful here because it will automatically gather all of the commits together and put them at the tip of the branch. This gets rid of the merge commits from master, and simplifies the history. Note that this is just the first step, and does not fix the large file issue itself.

The rebase-to-tip avoids a big problem that you would otherwise have with git rebase interactive, since there could be hundreds of commits made by others from those pulls from master you did earlier. Sadly, git rebase make you handle merge conflicts, but at least if all of yours are gathered together at the end, you are looking at 7 of your own commits altogether rather than 806 commits made by dozens of people working on code that you did not touch and know nothing about and are in no position to have to deal with merge conflicts in. So putting just your own commits altogether at the master tip totally avoids having to rebase and resolve conflicts through everybody elses work.

Because master is used so commonly, that is what appears here in our example, but it should be easy for developers to adapt this if needed to another branch.

Do this if you are not already on master or use a dev branch if that is in need of repair.

git checkout master   # or your dev branch
git fetch  # update origin/master
git rebase origin/master   # this is the magic.

If you get conflicts, you must resolve them. Yes, it is a minor pain, and you think, hey, I already resolved some of these earlier, why do I have to do it again? But rebase is not smart enough to do that for you. We are only doing this because we had no other way to fix the large file issue. Just be glad you do not have to re-do conflicts for other users too. Sometimes you get lucky and the merges are simple.

vi conflicted-file    # resolve conflicts by editing carefully
git add conflicted-file
git rebase --continue

You can use this if something goes horribly wrong:

git rebase --abort

Sometimes it may get stuck on an empty one where nothing happened, or it was optimized out, just run this to skip to proceed.

git rebase --skip

Now all of your commits are together at the tip, and they have not been pushed to master yet of course.


GIT REBASE INTERACTIVE

git rebase -i shaHashId    # this must be very carefully chosen.


FOLLOWUP

Finally without a large file in the branch history, we can push to shared repo. This is the whole reason we did all that work, so that we could do this. (If you repaired a dev branch, you will probably do something else here.)

git push   # of course if others pushed since your last update, you may have to git pull.

If you earlier used git stash to put something aside, you can use it restore the unchecked in work:

git stash pop   # ONLY if you saved it aside with git stash earlier, and it makes sense.