File too large checked in: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 26: Line 26:
WHY DO I FIND OUT ABOUT IT SO LATE?
WHY DO I FIND OUT ABOUT IT SO LATE?


It was critical that the central repository protect itself against pushes with large files,
When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users.
and we do that with a hook put in manually on our main site. Although git offers some ways
There are some incomplete and limited ways to add a hook that would detect it during git commit.
to copy hooks
We are looking into ways to improve this so you could get earlier warning about a file being to large.
 
WHY IS IT SO HARD TO FIX
 
Since git is a powerful source code control system, you might hope that it would easily handle this situation.
However, because git builds immutable trees, which are a good thing for so many purposes,
removing something or changing it requires changing the git history of the branch.
We must avoid pushing large files to the shared repo main branch.
Once it goes there, hundreds of users all over the world will pick it up automatically,
and there is no way to go around fixing up all of those copies to remove large files from their history.
 
However, git can indeed fix the history of a branch in your local git tree which has not been pushed.
And that is what we are going to do here.
 
FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN
 
In order to fix your branch, you are going to have to use some form of git rebase on it,
otherwise, it could never be fixed.
 
A common case is where a user realizes the mistaken large file,
and uses git rm to remove it, or uses git add to replace it with a smaller version of the file,
such as a test file or jpg image or pdf, and git commit. 
So the large file no longer exists on the tip of their branch.
However, it does exist in the history.
 
As usual with all of this stuff,
if you have unchecked in stuff,
check it in or use stash to clean up your repo for action.
 
git add    # this is often a good choice.
git commit
 
or
 
git stash  # only if needed
 
SQUASH?
If you were going to squash your development branch anyways,
then you can just merge --squash into the master branch,
and the system is smart enough to skip the large file that no longer exists
when it does so.
  git checkout master
As usual, may have to handle git conflicts during any merge.
  git merge --squash myDevBranch
 
If your changes were on master, and not on a dev branch,
turn your master branch into a dev branch,
and then create a new master branch, and squash that onto it.
Only do this if it makes sense.
  git fetch  # update origin/master
  git branch -m master tempMaster
  git branch master origin/master
Look at .git/config to fix master branch tracking if needed.
  git checkout master
  git merge --squash tempMaster
  git push
 
  # after
 
The benefit is that it is simple and you are done.
 
The disadvantage is that you lose your commit history,
and all those changes just became one commit on master branch.
This is just right for many users.
 
 
GIT CHERRY-PICK?
NOT RECOMMENDED
If you only have a handful of commits, and you know which ones they are,
you can try to use this method.  It is a tedious.  
You would have to use git log to find which specific commits need to be saved.
You might have to turn master branch into a dev or temp branch
as above, create new master, and then pick specific commits
from the temp branch onto master.
You may still need to do a git rebase -i if you cannot not
make the large file go away simply by skipping a no longer needed commit or two.
 
GIT REBASE

Revision as of 00:28, 25 April 2021

FILE TOO LARGE CHECKED IN and HOW TO FIX IT

When I do git push I see this error:

Exceeds file size limit 2200000. 

WHY BIG FILES ARE NOT ALLOWED

The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. The restriction has been implemented as a hook in the central shared repo that developers push to. We already did not want large files to be checked-in, and during the transition from CVS to git, many huge test files were removed. Also, github has size restrictions which have to be honored. And people will find kent repo excessively bloated and hard to use without this size restriction. This is a repository of source code text, which is small.

WHY PEOPLE CHECK IN BIG FILES

Because developers are encouraged to make standard tests subdirectory for their kent utilities, there are testing files which get checked in, and unless care is exercised, it is very easy for programmers who deal with giant genomics files to accidentally check them in. Also, sometimes people want to check in PDF documents and some reasonably sized JPG or PNG images. Please use JPG when it is a camera image for better compression and smaller size. PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a small number of colors. And sometimes, people just make a mistake, or forget about the limit.

WHY DO I FIND OUT ABOUT IT SO LATE?

When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. There are some incomplete and limited ways to add a hook that would detect it during git commit. We are looking into ways to improve this so you could get earlier warning about a file being to large.

WHY IS IT SO HARD TO FIX

Since git is a powerful source code control system, you might hope that it would easily handle this situation. However, because git builds immutable trees, which are a good thing for so many purposes, removing something or changing it requires changing the git history of the branch. We must avoid pushing large files to the shared repo main branch. Once it goes there, hundreds of users all over the world will pick it up automatically, and there is no way to go around fixing up all of those copies to remove large files from their history.

However, git can indeed fix the history of a branch in your local git tree which has not been pushed. And that is what we are going to do here.

FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN

In order to fix your branch, you are going to have to use some form of git rebase on it, otherwise, it could never be fixed.

A common case is where a user realizes the mistaken large file, and uses git rm to remove it, or uses git add to replace it with a smaller version of the file, such as a test file or jpg image or pdf, and git commit. So the large file no longer exists on the tip of their branch. However, it does exist in the history.

As usual with all of this stuff, if you have unchecked in stuff, check it in or use stash to clean up your repo for action.

git add # this is often a good choice. git commit

or

git stash # only if needed

SQUASH? If you were going to squash your development branch anyways, then you can just merge --squash into the master branch, and the system is smart enough to skip the large file that no longer exists when it does so.

 git checkout master

As usual, may have to handle git conflicts during any merge.

 git merge --squash myDevBranch

If your changes were on master, and not on a dev branch, turn your master branch into a dev branch, and then create a new master branch, and squash that onto it. Only do this if it makes sense.

 git fetch  # update origin/master
 git branch -m master tempMaster
 git branch master origin/master

Look at .git/config to fix master branch tracking if needed.

 git checkout master
 git merge --squash tempMaster
 git push
 # after
 

The benefit is that it is simple and you are done.

The disadvantage is that you lose your commit history, and all those changes just became one commit on master branch. This is just right for many users.


GIT CHERRY-PICK?

NOT RECOMMENDED

If you only have a handful of commits, and you know which ones they are, you can try to use this method. It is a tedious. You would have to use git log to find which specific commits need to be saved. You might have to turn master branch into a dev or temp branch as above, create new master, and then pick specific commits from the temp branch onto master. You may still need to do a git rebase -i if you cannot not make the large file go away simply by skipping a no longer needed commit or two.

GIT REBASE