File too large checked in: Difference between revisions
No edit summary |
No edit summary |
||
(16 intermediate revisions by 2 users not shown) | |||
Line 9: | Line 9: | ||
The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. | The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. | ||
The restriction has been implemented as a hook in the central shared repo that developers push to. | The restriction has been implemented as a hook in the central shared repo that developers push to. | ||
We | We did not want large files to be checked-in, and during the transition from CVS to git, | ||
many huge test files were removed. Also, github has size restrictions which have to be honored. | many huge test files were removed. Also, github has size restrictions which have to be honored. | ||
And people will find kent repo excessively bloated and hard to use without this size restriction. | And people will find kent repo excessively bloated and hard to use without this size restriction. | ||
Line 22: | Line 22: | ||
Please use JPG when it is a camera image for better compression and smaller size. | Please use JPG when it is a camera image for better compression and smaller size. | ||
PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a | PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a | ||
small number of colors. And sometimes, people just make a mistake, or forget about the limit. | small number of colors. And sometimes, people just make a mistake, or forget about the file size limit, avoiding large images is useful to keep in mind for [http://genomewiki.cse.ucsc.edu/genecats/index.php/New_track_checklist#Announce_on_indexNews.2C_newsArch.2C_Genome-announce.2C_FB.2C_Twitter genome-announce emails] as well. | ||
WHY DO I FIND OUT ABOUT IT SO LATE? | WHY DO I FIND OUT ABOUT IT SO LATE? | ||
Line 28: | Line 28: | ||
When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. | When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. | ||
There are some incomplete and limited ways to add a hook that would detect it during git commit. | There are some incomplete and limited ways to add a hook that would detect it during git commit. | ||
We are looking into ways to improve this so you could get earlier warning about a file being to large. | We are looking into ways to improve this so you could get an earlier warning about a file being to large. | ||
WHY IS IT SO HARD TO FIX | WHY IS IT SO HARD TO FIX | ||
Line 44: | Line 44: | ||
FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN | FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN | ||
In order to fix your branch, you are going to have to use some form of git rebase | In order to fix your branch, you are going to have to use some form of git rebase, | ||
otherwise, it could never be fixed. | otherwise, it could never be fixed. | ||
Line 53: | Line 53: | ||
However, it does exist in the history. | However, it does exist in the history. | ||
If you have unchecked in stuff, | |||
check it in or use stash to clean up your repo for action. | check it in or use stash to clean up your repo for action. | ||
git add | git add someFile # this is often a good choice. | ||
git commit | git commit | ||
Line 64: | Line 63: | ||
git stash # only if needed | git stash # only if needed | ||
REMOVE THE BIG FILES or REPLACE WITH SMALLER ONES | |||
If you have not removed or replaced the large file already, | |||
you can do this: | |||
git rm someFileLarge | git rm someFileLarge | ||
or edit the large file to reduce its size and re-add it | or edit the large file to reduce its size and re-add it | ||
Line 74: | Line 73: | ||
follow up with the usual | follow up with the usual | ||
git commit | git commit | ||
So there is no large file on your branch tip. | |||
So now there is no large file on your branch tip. | |||
SQUASH? | |||
The system is smart enough to skip large files that no longer exist | The system is smart enough to skip large files that no longer exist | ||
when it does the squash. | when it does the squash. | ||
Line 80: | Line 84: | ||
People often squash your development branch anyways, | People often squash your development branch anyways, | ||
which makes code-review easier since it is just one big commit. | which makes code-review easier since it is just one big commit. | ||
If on the other hand you know you do not want to squash, | |||
but want to keep all the separate commits, skip ahead to GIT REBASE section. | |||
IF YOUR LARGE FILE IS ON A DEV BRANCH | IF YOUR LARGE FILE IS ON A DEV BRANCH | ||
Line 87: | Line 93: | ||
git merge --squash myDevBranch | git merge --squash myDevBranch | ||
Rename the squashed branch so you know it was done | Rename the squashed branch so you know it was done | ||
git branch - | git branch -m myDevBranch myDevBranchSquashed | ||
Eventually, you will need to delete myDevBranchSquashed to recover its space if you care. | Eventually, you will need to delete myDevBranchSquashed to recover its space if you care. | ||
Line 95: | Line 101: | ||
and then create a new master branch, and squash that onto it. | and then create a new master branch, and squash that onto it. | ||
Only do this if it makes sense. | Only do this if it makes sense. | ||
git fetch # update origin/master | |||
git branch -m master tempMaster | |||
git branch master origin/master | |||
Look at .git/config to fix master branch tracking if needed. | Look at .git/config to fix master branch tracking if needed. | ||
git checkout master | |||
git merge --squash tempMaster | |||
git push | |||
After a few days, you can delete tempMaster if you do not need it, | After a few days, you can delete tempMaster if you do not need it, | ||
this should also allow git garbage collection to clean that | this should also allow git garbage collection to clean that | ||
large file from your own local repo. | large file from your own local repo. | ||
git branch -D tempMaster | |||
The benefit of SQUASH is that it is simple and you are done. | The benefit of SQUASH is that it is simple and you are done. | ||
Line 114: | Line 120: | ||
This is just right for many users. | This is just right for many users. | ||
GIT REBASE | GIT REBASE | ||
Line 132: | Line 125: | ||
Use the squash method (see above) if that works for you. | Use the squash method (see above) if that works for you. | ||
But otherwise, use git rebase. | But otherwise, use git rebase. | ||
It will preserve your individual commits and their messages and history | |||
if that is important to you. | |||
Git rebase is our friend for crises like this. | Git rebase is our friend for crises like this. | ||
Line 151: | Line 146: | ||
If you ABSOLUTELY certain that you have NOT git pulled even once on your problem branch, | If you ABSOLUTELY certain that you have NOT git pulled even once on your problem branch, | ||
then skip this step and go ahead to the GIT REBASE INTERACTIVE section. | then skip this step and go ahead to the GIT REBASE INTERACTIVE section. | ||
GIT REBASE TO TIP | GIT REBASE TO TIP | ||
Line 178: | Line 174: | ||
git fetch # update origin/master | git fetch # update origin/master | ||
git rebase origin/master # this is the magic. | git rebase origin/master # this is the magic. | ||
Line 208: | Line 202: | ||
GIT REBASE INTERACTIVE | GIT REBASE INTERACTIVE | ||
Look at the history, all your commits | Look at the history, all your commits should be at the tip. | ||
Find how far back | Stop if not. You should not even see as much as one merge. | ||
the parent of your commmits | |||
Find how far back your unpushed commits go. | |||
Then use the sha hash Id of the parent of your commmits | |||
which is often the same as the value in origin/master. | |||
The goal here is to focus on which commits contain the bad large files | The goal here is to focus on which commits contain the bad large files | ||
that you do not want. We need to remove | that you do not want. We need to remove those files from those commits so it never happened. | ||
Find common ancestor. | |||
git merge-base master origin/master # | git merge-base master origin/master # can use some dev branch instead of master. | ||
This sha hash id for the common ancestral commit serves is our shaHashIdOnto value. | |||
Save the output for use below and confirm these are your correct commits, and show the bad large files. | Save the output for use below and confirm these are your correct commits, and show the bad large files. | ||
git log --stat shaHashIdOnto..HEAD > | git log --stat shaHashIdOnto..HEAD > myCommits.txt # which will have the too large files | ||
Look at | Look at myCommits.txt, can refer to it later as needed. | ||
If you see commits that are NOT your work, stop, something is wrong. | If you see commits that are NOT your work, stop, something is wrong. | ||
It should not have any merges in it. | |||
If you see that not all of your commits are there, stop, something is wrong. | If you see that not all of your commits are there, stop, something is wrong. | ||
Line 228: | Line 228: | ||
It should contain the full list of all your unpushed commits on this branch. | It should contain the full list of all your unpushed commits on this branch. | ||
It should not show commits made by other people, they should all be your work. | It should not show commits made by other people, they should all be your work. | ||
It should not have any merges in it. | |||
If it has the wrong stuff, abort the rebase (see below). | If it has the wrong stuff, abort the rebase (see below). | ||
Line 243: | Line 244: | ||
and let you edit it. | and let you edit it. | ||
After changing the default commit list, we delete the first entry, change the 2nd to edit | After changing the default commit list, we delete the first entry, change the 2nd to edit. | ||
Save and quit the editor. | |||
edit ce36c98 Oops large file and other stuff to keep. | edit ce36c98 Oops large file and other stuff to keep. | ||
Line 268: | Line 270: | ||
(If you repaired a dev branch, you will probably do something else here.) | (If you repaired a dev branch, you will probably do something else here.) | ||
git push # | git push # if others pushed since your last update, you may have to git pull first. | ||
If you earlier used git stash to put something aside, | If you earlier used git stash to put something aside, |
Latest revision as of 19:03, 26 April 2021
FILE TOO LARGE CHECKED IN and HOW TO FIX IT
When I do git push I see this error:
Exceeds file size limit 2200000.
WHY BIG FILES ARE NOT ALLOWED
The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. The restriction has been implemented as a hook in the central shared repo that developers push to. We did not want large files to be checked-in, and during the transition from CVS to git, many huge test files were removed. Also, github has size restrictions which have to be honored. And people will find kent repo excessively bloated and hard to use without this size restriction. This is a repository of source code text, which is small.
WHY PEOPLE CHECK IN BIG FILES
Because developers are encouraged to make standard tests subdirectory for their kent utilities, there are testing files which get checked in, and unless care is exercised, it is very easy for programmers who deal with giant genomics files to accidentally check them in. Also, sometimes people want to check in PDF documents and some reasonably sized JPG or PNG images. Please use JPG when it is a camera image for better compression and smaller size. PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a small number of colors. And sometimes, people just make a mistake, or forget about the file size limit, avoiding large images is useful to keep in mind for genome-announce emails as well.
WHY DO I FIND OUT ABOUT IT SO LATE?
When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. There are some incomplete and limited ways to add a hook that would detect it during git commit. We are looking into ways to improve this so you could get an earlier warning about a file being to large.
WHY IS IT SO HARD TO FIX
Since git is a powerful source code control system, you might hope that it would easily handle this situation. However, because git builds immutable trees, which are a good thing for so many purposes, removing something or changing it requires changing the git history of the branch. We must avoid pushing large files to the shared repo main branch. Once it goes there, hundreds of users all over the world will pick it up automatically, and there is no way to go around fixing up all of those copies to remove large files from their history.
However, git can indeed fix the history of a branch in your local git tree which has not been pushed. And that is what we are going to do here.
FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN
In order to fix your branch, you are going to have to use some form of git rebase, otherwise, it could never be fixed.
A common case is where a user realizes the mistaken large file, and uses git rm to remove it, or uses git add to replace it with a smaller version of the file, such as a test file or jpg image or pdf, and git commit. So the large file no longer exists on the tip of their branch. However, it does exist in the history.
If you have unchecked in stuff, check it in or use stash to clean up your repo for action.
git add someFile # this is often a good choice. git commit
or
git stash # only if needed
REMOVE THE BIG FILES or REPLACE WITH SMALLER ONES
If you have not removed or replaced the large file already, you can do this:
git rm someFileLarge
or edit the large file to reduce its size and re-add it
git add someFileNowSmaller
follow up with the usual
git commit
So now there is no large file on your branch tip.
SQUASH?
The system is smart enough to skip large files that no longer exist when it does the squash.
People often squash your development branch anyways, which makes code-review easier since it is just one big commit. If on the other hand you know you do not want to squash, but want to keep all the separate commits, skip ahead to GIT REBASE section.
IF YOUR LARGE FILE IS ON A DEV BRANCH
git checkout master
As usual, may have to handle git conflicts during any merge
git merge --squash myDevBranch
Rename the squashed branch so you know it was done
git branch -m myDevBranch myDevBranchSquashed
Eventually, you will need to delete myDevBranchSquashed to recover its space if you care.
IF OUR LARGE FILE IS ON MASTER BRANCH
turn your master branch into a dev branch, and then create a new master branch, and squash that onto it. Only do this if it makes sense.
git fetch # update origin/master git branch -m master tempMaster git branch master origin/master
Look at .git/config to fix master branch tracking if needed.
git checkout master git merge --squash tempMaster git push
After a few days, you can delete tempMaster if you do not need it, this should also allow git garbage collection to clean that large file from your own local repo.
git branch -D tempMaster
The benefit of SQUASH is that it is simple and you are done.
The disadvantage is that you lose your commit history, and all those changes just became one big commit on master branch. This is just right for many users.
GIT REBASE
Use the squash method (see above) if that works for you.
But otherwise, use git rebase. It will preserve your individual commits and their messages and history if that is important to you.
Git rebase is our friend for crises like this. But it has to be used properly.
HAVE YOU MERGED FROM MASTER?
In particular, if you have merged from master already, before you noticed the large file error message later during pushing, you could easily have dozens of your own commits and hundreds of commits made by other people from pulling in from the master branch which has commits from the entire team, it might even be months since you last successfully pushed, but you already pulled several times.
So if you have done even one merge from master before you discovered the problem, which is pretty common to happen, then you should proceed with GIT REBASE TO TIP.
If you ABSOLUTELY certain that you have NOT git pulled even once on your problem branch, then skip this step and go ahead to the GIT REBASE INTERACTIVE section.
GIT REBASE TO TIP
git rebasing of your entire branch onto the tip of the master branch tree is super useful here because it will automatically gather all of the commits together and put them at the tip of the branch. This gets rid of the merge commits from master, and simplifies the history. Note that this is just the first step, and does not fix the large file issue itself.
The rebase-to-tip avoids a big problem that you would otherwise have with git rebase interactive, since there could be hundreds of commits made by others from those pulls from master you did earlier. Sadly, git rebase make you handle merge conflicts, but at least if all of yours are gathered together at the end, you are looking at 7 of your own commits altogether rather than 806 commits made by dozens of people working on code that you did not touch and know nothing about and are in no position to have to deal with merge conflicts in. So putting just your own commits altogether at the master tip totally avoids having to rebase and resolve conflicts through everybody elses work.
Because master is used so commonly, that is what appears here in our example, but it should be easy for developers to adapt this if needed to another branch.
Do this if you are not already on master or use a dev branch if that is in need of repair.
git checkout master # or your dev branch
git fetch # update origin/master
git rebase origin/master # this is the magic.
If you get conflicts, you must resolve them. Yes, it is a minor pain, and you think, hey, I already resolved some of these earlier, why do I have to do it again? But rebase is not smart enough to do that for you. We are only doing this because we had no other way to fix the large file issue. Just be glad you do not have to re-do conflicts for other users too. Sometimes you get lucky and the merges are simple.
vi conflicted-file # resolve conflicts by editing carefully git add conflicted-file git rebase --continue
You can use this if something goes horribly wrong:
git rebase --abort
Sometimes it may get stuck on an empty one where nothing happened, or it was optimized out, just run this to skip to proceed.
git rebase --skip
Now all of your commits are together at the tip, and they have not been pushed to master yet of course.
GIT REBASE INTERACTIVE
Look at the history, all your commits should be at the tip. Stop if not. You should not even see as much as one merge.
Find how far back your unpushed commits go. Then use the sha hash Id of the parent of your commmits which is often the same as the value in origin/master. The goal here is to focus on which commits contain the bad large files that you do not want. We need to remove those files from those commits so it never happened.
Find common ancestor.
git merge-base master origin/master # can use some dev branch instead of master.
This sha hash id for the common ancestral commit serves is our shaHashIdOnto value.
Save the output for use below and confirm these are your correct commits, and show the bad large files.
git log --stat shaHashIdOnto..HEAD > myCommits.txt # which will have the too large files
Look at myCommits.txt, can refer to it later as needed. If you see commits that are NOT your work, stop, something is wrong. It should not have any merges in it. If you see that not all of your commits are there, stop, something is wrong.
git rebase -i shaHashIdOnto # Run this only after confirming it is the correct value.
The rebase command is going to pop up a list of commits with the default action "pick". It should contain the full list of all your unpushed commits on this branch. It should not show commits made by other people, they should all be your work. It should not have any merges in it. If it has the wrong stuff, abort the rebase (see below).
pick f26dd66 Oops large file pick ce36c98 Oops large file and other stuff to keep. pick f772d66 Other good stuff to keep
If you have a line for a commit that is no longer needed, for example, the only thing in that commit was the large file that you are trying to get rid of, then simply delete that line. Then the commit will simply be removed and disappear from the rebase result.
If the commit contains the large file but other stuff you want to keep, change it from "pick" to "edit". The system will stop at that commit, and let you edit it.
After changing the default commit list, we delete the first entry, change the 2nd to edit. Save and quit the editor.
edit ce36c98 Oops large file and other stuff to keep. pick f772d66 Other good stuff to keep
As rebase stops at "Oops large file and other stuff to keep." Remove the offending file from the index.
git rm --cached someLargeFile
Amends the commit, -C HEAD instructs git to reuse the old commit message.
git commit --amend -C HEAD
Finally, git rebase --continue goes ahead with the rest of the rebase operation.
git rebase --continue
If all else fails, can do
git rebase --abort
FOLLOWUP
Finally without a large file in the branch history, we can push to shared repo. This is the whole reason we did all that work, so that we could do this. (If you repaired a dev branch, you will probably do something else here.)
git push # if others pushed since your last update, you may have to git pull first.
If you earlier used git stash to put something aside, you can use it restore the unchecked in work:
git stash pop # ONLY if you saved it aside with git stash earlier, and it makes sense.