File too large checked in

From genomewiki
Jump to navigationJump to search

FILE TOO LARGE CHECKED IN and HOW TO FIX IT

When I do git push I see this error:

Exceeds file size limit 2200000. 

WHY BIG FILES ARE NOT ALLOWED

The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. The restriction has been implemented as a hook in the central shared repo that developers push to. We did not want large files to be checked-in, and during the transition from CVS to git, many huge test files were removed. Also, github has size restrictions which have to be honored. And people will find kent repo excessively bloated and hard to use without this size restriction. This is a repository of source code text, which is small.

WHY PEOPLE CHECK IN BIG FILES

Because developers are encouraged to make standard tests subdirectory for their kent utilities, there are testing files which get checked in, and unless care is exercised, it is very easy for programmers who deal with giant genomics files to accidentally check them in. Also, sometimes people want to check in PDF documents and some reasonably sized JPG or PNG images. Please use JPG when it is a camera image for better compression and smaller size. PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a small number of colors. And sometimes, people just make a mistake, or forget about the file size limit, avoiding large images is useful to keep in mind for genome-announce emails as well.

WHY DO I FIND OUT ABOUT IT SO LATE?

When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. There are some incomplete and limited ways to add a hook that would detect it during git commit. We are looking into ways to improve this so you could get an earlier warning about a file being to large.

WHY IS IT SO HARD TO FIX

Since git is a powerful source code control system, you might hope that it would easily handle this situation. However, because git builds immutable trees, which are a good thing for so many purposes, removing something or changing it requires changing the git history of the branch. We must avoid pushing large files to the shared repo main branch. Once it goes there, hundreds of users all over the world will pick it up automatically, and there is no way to go around fixing up all of those copies to remove large files from their history.

However, git can indeed fix the history of a branch in your local git tree which has not been pushed. And that is what we are going to do here.

FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN

In order to fix your branch, you are going to have to use some form of git rebase, otherwise, it could never be fixed.

A common case is where a user realizes the mistaken large file, and uses git rm to remove it, or uses git add to replace it with a smaller version of the file, such as a test file or jpg image or pdf, and git commit. So the large file no longer exists on the tip of their branch. However, it does exist in the history.

If you have unchecked in stuff, check it in or use stash to clean up your repo for action.

git add someFile   # this is often a good choice.
git commit

or

git stash  # only if needed


REMOVE THE BIG FILES or REPLACE WITH SMALLER ONES

If you have not removed or replaced the large file already, you can do this:

 git rm someFileLarge

or edit the large file to reduce its size and re-add it

 git add someFileNowSmaller

follow up with the usual

 git commit

So now there is no large file on your branch tip.


SQUASH?

The system is smart enough to skip large files that no longer exist when it does the squash.

People often squash your development branch anyways, which makes code-review easier since it is just one big commit. If on the other hand you know you do not want to squash, but want to keep all the separate commits, skip ahead to GIT REBASE section.

IF YOUR LARGE FILE IS ON A DEV BRANCH

 git checkout master

As usual, may have to handle git conflicts during any merge

 git merge --squash myDevBranch

Rename the squashed branch so you know it was done

 git branch -m myDevBranch myDevBranchSquashed

Eventually, you will need to delete myDevBranchSquashed to recover its space if you care.

IF OUR LARGE FILE IS ON MASTER BRANCH

turn your master branch into a dev branch, and then create a new master branch, and squash that onto it. Only do this if it makes sense.

git fetch  # update origin/master
git branch -m master tempMaster
git branch master origin/master

Look at .git/config to fix master branch tracking if needed.

git checkout master
git merge --squash tempMaster
git push

After a few days, you can delete tempMaster if you do not need it, this should also allow git garbage collection to clean that large file from your own local repo.

git branch -D tempMaster
 

The benefit of SQUASH is that it is simple and you are done.

The disadvantage is that you lose your commit history, and all those changes just became one big commit on master branch. This is just right for many users.


GIT REBASE

Use the squash method (see above) if that works for you.

But otherwise, use git rebase. It will preserve your individual commits and their messages and history if that is important to you.

Git rebase is our friend for crises like this. But it has to be used properly.

HAVE YOU MERGED FROM MASTER?

In particular, if you have merged from master already, before you noticed the large file error message later during pushing, you could easily have dozens of your own commits and hundreds of commits made by other people from pulling in from the master branch which has commits from the entire team, it might even be months since you last successfully pushed, but you already pulled several times.

So if you have done even one merge from master before you discovered the problem, which is pretty common to happen, then you should proceed with GIT REBASE TO TIP.

If you ABSOLUTELY certain that you have NOT git pulled even once on your problem branch, then skip this step and go ahead to the GIT REBASE INTERACTIVE section.


GIT REBASE TO TIP

git rebasing of your entire branch onto the tip of the master branch tree is super useful here because it will automatically gather all of the commits together and put them at the tip of the branch. This gets rid of the merge commits from master, and simplifies the history. Note that this is just the first step, and does not fix the large file issue itself.

The rebase-to-tip avoids a big problem that you would otherwise have with git rebase interactive, since there could be hundreds of commits made by others from those pulls from master you did earlier. Sadly, git rebase make you handle merge conflicts, but at least if all of yours are gathered together at the end, you are looking at 7 of your own commits altogether rather than 806 commits made by dozens of people working on code that you did not touch and know nothing about and are in no position to have to deal with merge conflicts in. So putting just your own commits altogether at the master tip totally avoids having to rebase and resolve conflicts through everybody elses work.

Because master is used so commonly, that is what appears here in our example, but it should be easy for developers to adapt this if needed to another branch.

Do this if you are not already on master or use a dev branch if that is in need of repair.

git checkout master   # or your dev branch
git fetch  # update origin/master
git rebase origin/master   # this is the magic.

If you get conflicts, you must resolve them. Yes, it is a minor pain, and you think, hey, I already resolved some of these earlier, why do I have to do it again? But rebase is not smart enough to do that for you. We are only doing this because we had no other way to fix the large file issue. Just be glad you do not have to re-do conflicts for other users too. Sometimes you get lucky and the merges are simple.

vi conflicted-file    # resolve conflicts by editing carefully
git add conflicted-file
git rebase --continue

You can use this if something goes horribly wrong:

git rebase --abort

Sometimes it may get stuck on an empty one where nothing happened, or it was optimized out, just run this to skip to proceed.

git rebase --skip

Now all of your commits are together at the tip, and they have not been pushed to master yet of course.


GIT REBASE INTERACTIVE

Look at the history, all your commits should be at the tip. Stop if not. You should not even see as much as one merge.

Find how far back your unpushed commits go. Then use the sha hash Id of the parent of your commmits which is often the same as the value in origin/master. The goal here is to focus on which commits contain the bad large files that you do not want. We need to remove those files from those commits so it never happened.


Find common ancestor.

git merge-base master origin/master  # can use some dev branch instead of master.

This sha hash id for the common ancestral commit serves is our shaHashIdOnto value.

Save the output for use below and confirm these are your correct commits, and show the bad large files.

git log --stat shaHashIdOnto..HEAD > myCommits.txt  # which will have the too large files

Look at myCommits.txt, can refer to it later as needed. If you see commits that are NOT your work, stop, something is wrong. It should not have any merges in it. If you see that not all of your commits are there, stop, something is wrong.

git rebase -i shaHashIdOnto    # Run this only after confirming it is the correct value.

The rebase command is going to pop up a list of commits with the default action "pick". It should contain the full list of all your unpushed commits on this branch. It should not show commits made by other people, they should all be your work. It should not have any merges in it. If it has the wrong stuff, abort the rebase (see below).

pick f26dd66 Oops large file
pick ce36c98 Oops large file and other stuff to keep.
pick f772d66 Other good stuff to keep

If you have a line for a commit that is no longer needed, for example, the only thing in that commit was the large file that you are trying to get rid of, then simply delete that line. Then the commit will simply be removed and disappear from the rebase result.

If the commit contains the large file but other stuff you want to keep, change it from "pick" to "edit". The system will stop at that commit, and let you edit it.

After changing the default commit list, we delete the first entry, change the 2nd to edit. Save and quit the editor.

edit ce36c98 Oops large file and other stuff to keep.
pick f772d66 Other good stuff to keep

As rebase stops at "Oops large file and other stuff to keep." Remove the offending file from the index.

git rm --cached someLargeFile

Amends the commit, -C HEAD instructs git to reuse the old commit message.

git commit --amend -C HEAD

Finally, git rebase --continue goes ahead with the rest of the rebase operation.

git rebase --continue

If all else fails, can do

git rebase --abort


FOLLOWUP

Finally without a large file in the branch history, we can push to shared repo. This is the whole reason we did all that work, so that we could do this. (If you repaired a dev branch, you will probably do something else here.)

git push   # if others pushed since your last update, you may have to git pull first.

If you earlier used git stash to put something aside, you can use it restore the unchecked in work:

git stash pop   # ONLY if you saved it aside with git stash earlier, and it makes sense.