File too large checked in: Difference between revisions
No edit summary |
(Adding a link to another part in our wiki where we try to remind people not to use large images in genome-announce emails too http://genomewiki.cse.ucsc.edu/genecats/index.php/New_track_checklist#Announce_on_indexNews.2C_newsArch.2C_Genome-announce.2C_FB.) |
||
Line 22: | Line 22: | ||
Please use JPG when it is a camera image for better compression and smaller size. | Please use JPG when it is a camera image for better compression and smaller size. | ||
PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a | PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a | ||
small number of colors. And sometimes, people just make a mistake, or forget about the file size limit. | small number of colors. And sometimes, people just make a mistake, or forget about the file size limit, avoiding large images is useful to keep in mind for [http://genomewiki.cse.ucsc.edu/genecats/index.php/New_track_checklist#Announce_on_indexNews.2C_newsArch.2C_Genome-announce.2C_FB.2C_Twitter genome-announce emails] as well. | ||
WHY DO I FIND OUT ABOUT IT SO LATE? | WHY DO I FIND OUT ABOUT IT SO LATE? |
Revision as of 15:56, 26 April 2021
FILE TOO LARGE CHECKED IN and HOW TO FIX IT
When I do git push I see this error:
Exceeds file size limit 2200000.
WHY BIG FILES ARE NOT ALLOWED
The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. The restriction has been implemented as a hook in the central shared repo that developers push to. We did not want large files to be checked-in, and during the transition from CVS to git, many huge test files were removed. Also, github has size restrictions which have to be honored. And people will find kent repo excessively bloated and hard to use without this size restriction. This is a repository of source code text, which is small.
WHY PEOPLE CHECK IN BIG FILES
Because developers are encouraged to make standard tests subdirectory for their kent utilities, there are testing files which get checked in, and unless care is exercised, it is very easy for programmers who deal with giant genomics files to accidentally check them in. Also, sometimes people want to check in PDF documents and some reasonably sized JPG or PNG images. Please use JPG when it is a camera image for better compression and smaller size. PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a small number of colors. And sometimes, people just make a mistake, or forget about the file size limit, avoiding large images is useful to keep in mind for genome-announce emails as well.
WHY DO I FIND OUT ABOUT IT SO LATE?
When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. There are some incomplete and limited ways to add a hook that would detect it during git commit. We are looking into ways to improve this so you could get an earlier warning about a file being to large.
WHY IS IT SO HARD TO FIX
Since git is a powerful source code control system, you might hope that it would easily handle this situation. However, because git builds immutable trees, which are a good thing for so many purposes, removing something or changing it requires changing the git history of the branch. We must avoid pushing large files to the shared repo main branch. Once it goes there, hundreds of users all over the world will pick it up automatically, and there is no way to go around fixing up all of those copies to remove large files from their history.
However, git can indeed fix the history of a branch in your local git tree which has not been pushed. And that is what we are going to do here.
FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN
In order to fix your branch, you are going to have to use some form of git rebase, otherwise, it could never be fixed.
A common case is where a user realizes the mistaken large file, and uses git rm to remove it, or uses git add to replace it with a smaller version of the file, such as a test file or jpg image or pdf, and git commit. So the large file no longer exists on the tip of their branch. However, it does exist in the history.
If you have unchecked in stuff, check it in or use stash to clean up your repo for action.
git add someFile # this is often a good choice. git commit
or
git stash # only if needed
REMOVE THE BIG FILES or REPLACE WITH SMALLER ONES
If you have not removed or replaced the large file already, you can do this:
git rm someFileLarge
or edit the large file to reduce its size and re-add it
git add someFileNowSmaller
follow up with the usual
git commit
So now there is no large file on your branch tip.
SQUASH?
The system is smart enough to skip large files that no longer exist when it does the squash.
People often squash your development branch anyways, which makes code-review easier since it is just one big commit. If on the other hand you know you do not want to squash, but want to keep all the separate commits, skip ahead to GIT REBASE section.
IF YOUR LARGE FILE IS ON A DEV BRANCH
git checkout master
As usual, may have to handle git conflicts during any merge
git merge --squash myDevBranch
Rename the squashed branch so you know it was done
git branch -m myDevBranch myDevBranchSquashed
Eventually, you will need to delete myDevBranchSquashed to recover its space if you care.
IF OUR LARGE FILE IS ON MASTER BRANCH
turn your master branch into a dev branch, and then create a new master branch, and squash that onto it. Only do this if it makes sense.
git fetch # update origin/master git branch -m master tempMaster git branch master origin/master
Look at .git/config to fix master branch tracking if needed.
git checkout master git merge --squash tempMaster git push
After a few days, you can delete tempMaster if you do not need it, this should also allow git garbage collection to clean that large file from your own local repo.
git branch -D tempMaster
The benefit of SQUASH is that it is simple and you are done.
The disadvantage is that you lose your commit history, and all those changes just became one big commit on master branch. This is just right for many users.
GIT REBASE
Use the squash method (see above) if that works for you.
But otherwise, use git rebase. It will preserve your individual commits and their messages and history if that is important to you.
Git rebase is our friend for crises like this. But it has to be used properly.
HAVE YOU MERGED FROM MASTER?
In particular, if you have merged from master already, before you noticed the large file error message later during pushing, you could easily have dozens of your own commits and hundreds of commits made by other people from pulling in from the master branch which has commits from the entire team, it might even be months since you last successfully pushed, but you already pulled several times.
So if you have done even one merge from master before you discovered the problem, which is pretty common to happen, then you should proceed with GIT REBASE TO TIP.
If you ABSOLUTELY certain that you have NOT git pulled even once on your problem branch, then skip this step and go ahead to the GIT REBASE INTERACTIVE section.
GIT REBASE TO TIP
git rebasing of your entire branch onto the tip of the master branch tree is super useful here because it will automatically gather all of the commits together and put them at the tip of the branch. This gets rid of the merge commits from master, and simplifies the history. Note that this is just the first step, and does not fix the large file issue itself.
The rebase-to-tip avoids a big problem that you would otherwise have with git rebase interactive, since there could be hundreds of commits made by others from those pulls from master you did earlier. Sadly, git rebase make you handle merge conflicts, but at least if all of yours are gathered together at the end, you are looking at 7 of your own commits altogether rather than 806 commits made by dozens of people working on code that you did not touch and know nothing about and are in no position to have to deal with merge conflicts in. So putting just your own commits altogether at the master tip totally avoids having to rebase and resolve conflicts through everybody elses work.
Because master is used so commonly, that is what appears here in our example, but it should be easy for developers to adapt this if needed to another branch.
Do this if you are not already on master or use a dev branch if that is in need of repair.
git checkout master # or your dev branch
git fetch # update origin/master
git rebase origin/master # this is the magic.
If you get conflicts, you must resolve them. Yes, it is a minor pain, and you think, hey, I already resolved some of these earlier, why do I have to do it again? But rebase is not smart enough to do that for you. We are only doing this because we had no other way to fix the large file issue. Just be glad you do not have to re-do conflicts for other users too. Sometimes you get lucky and the merges are simple.
vi conflicted-file # resolve conflicts by editing carefully git add conflicted-file git rebase --continue
You can use this if something goes horribly wrong:
git rebase --abort
Sometimes it may get stuck on an empty one where nothing happened, or it was optimized out, just run this to skip to proceed.
git rebase --skip
Now all of your commits are together at the tip, and they have not been pushed to master yet of course.
GIT REBASE INTERACTIVE
Look at the history, all your commits should be at the tip. Stop if not. You should not even see as much as one merge.
Find how far back your unpushed commits go. Then use the sha hash Id of the parent of your commmits which is often the same as the value in origin/master. The goal here is to focus on which commits contain the bad large files that you do not want. We need to remove those files from those commits so it never happened.
Find common ancestor.
git merge-base master origin/master # can use some dev branch instead of master.
This sha hash id for the common ancestral commit serves is our shaHashIdOnto value.
Save the output for use below and confirm these are your correct commits, and show the bad large files.
git log --stat shaHashIdOnto..HEAD > myCommits.txt # which will have the too large files
Look at myCommits.txt, can refer to it later as needed. If you see commits that are NOT your work, stop, something is wrong. It should not have any merges in it. If you see that not all of your commits are there, stop, something is wrong.
git rebase -i shaHashIdOnto # Run this only after confirming it is the correct value.
The rebase command is going to pop up a list of commits with the default action "pick". It should contain the full list of all your unpushed commits on this branch. It should not show commits made by other people, they should all be your work. It should not have any merges in it. If it has the wrong stuff, abort the rebase (see below).
pick f26dd66 Oops large file pick ce36c98 Oops large file and other stuff to keep. pick f772d66 Other good stuff to keep
If you have a line for a commit that is no longer needed, for example, the only thing in that commit was the large file that you are trying to get rid of, then simply delete that line. Then the commit will simply be removed and disappear from the rebase result.
If the commit contains the large file but other stuff you want to keep, change it from "pick" to "edit". The system will stop at that commit, and let you edit it.
After changing the default commit list, we delete the first entry, change the 2nd to edit. Save and quit the editor.
edit ce36c98 Oops large file and other stuff to keep. pick f772d66 Other good stuff to keep
As rebase stops at "Oops large file and other stuff to keep." Remove the offending file from the index.
git rm --cached someLargeFile
Amends the commit, -C HEAD instructs git to reuse the old commit message.
git commit --amend -C HEAD
Finally, git rebase --continue goes ahead with the rest of the rebase operation.
git rebase --continue
If all else fails, can do
git rebase --abort
FOLLOWUP
Finally without a large file in the branch history, we can push to shared repo. This is the whole reason we did all that work, so that we could do this. (If you repaired a dev branch, you will probably do something else here.)
git push # of course if others pushed since your last update, you may have to git pull.
If you earlier used git stash to put something aside, you can use it restore the unchecked in work:
git stash pop # ONLY if you saved it aside with git stash earlier, and it makes sense.