Source Control

From genomewiki
Jump to navigationJump to search

spoiler: source is on git now ... see Working with Git

Introduction

Recently we discussed alternate source code control systems including Subversion which is a centralized system most similar to CVS. We also discussed Git, Mercurial, and Bazaar which are examples of distributed or peer-to-peer architecture.

CVS characteristics

Timing test on checkout speed:

CVS root tree on local filesystem, checkout on local filesystem, no network involved:

Full kent source tree check out, 125 Mb == 2m30s

Current worst case, CVSROOT pointing to hgwdev SYMLINK hosted on silo NFS server pointing to kkusr01 NFS server, thus three different networked servers in the path:

Full kent source tree check out, 135 Mb == 2m33s

Note, networked connections add essentially no time at all to a full source checkout. Any perceived slow down for CVS operations is either due to lock interference, or networking problems between the various NFS servers.

Tagging the local copy of the source tree: 3m30s

New tool fanatics

It wasn't explicitly stated yesterday why it might be a good idea to move to a new source control system. There were implied hints of why it may be useful, but not clearly stated. Such goals should be kept in mind, otherwise we are merely using our time to be geeks and play with new tools which is always fun.

  1. Easier to do longer term development on major features
  2. Easier to rearrange the source tree which may be useful the future for a burgeoning list of advanced features to be built.
  3. Tracking system is necessary to aid QA procedures and user ticket tracking. A tracking system should allow user's to enter feature requests and bug reports.

I'm momentarily blanking on other reasons why this may be useful. Should we keep this going in an endless email thread, or can we summarize and maintain in a Wiki page ?

Responses to New tool fanatics

Various email responses are summarized below.

  • Slow to check out or update or anything (this isn't because of CVS, this is due to server infrastructure)
  • Unreliable check-ins
  • Crappy handling of binary files
  • Unable to commit code changes when SOE servers break or you're offline
  • no more cvs locks
    • you don't have to be careful not to pipe the output of cvs commands into something like more that will hang with lock
  • no missed code changes in our cvs-reports/code review
  • Branching code in CVS is so annoying to do branches that everyone avoids it.
    • Difficult/impossible to commit changes before things are "done"
    • See below for more on branching

Branching

All but trivial branching is painful in CVS. This is really the key piece in being able to do parallel development, both for a single developer and multiple developers working together out of the mainline. From what I have read, Bazaar and Mercurial handle some more difficult merge cases that are beyond git and svn. See Intelligent Merging after Moves or Renames

Generally, I found this comparison educational because it covers a lot of systems, including commercial.

Although a lot of the issues some systems deal are not that common in this environment, which is actually not near as complex as it can get. Like I have read about companies that have actively maintained branches that were 10 years old!

I think we will be really happy with git. Especially if everyone embraces branches as their friend...

Beware, unless branches are completed, they become dead experiments. This is fine as long as the system can make it easy to distinguish between dead experiments and real branches.

Requirements for whatever we choose

  • procedure to do code reviews, including
    • only changes between 2 points (eg. this week & last week)
    • ability to check all reviews were done
    • ability to check all responses were implemented
  • automated procedure to create a release, including
    • incrementing #define CGI_VERSION in versionInfo.h
    • 'tagging' source files
    • building binaries
  • automated way to patch & rebuild binaries for a release
  • automated way for the public (esp. mirrors) to
    • retrieve the source either from the repository or as plain source
    • from an external download site, presumably hgdownload
  • some way to communicate with QA
    • possibly associating commits with issues in a tracking system

Git: Proposed Solution 1

Reorganizing the source

We can create a single repository of all the source. I did made the initial git repo with the 'git cvs import' command (took a few hours to run and recreate the entire history of changesets, but only needs to be done once).

Size of repository 100Mb
Time to check for updates vs local repo ?s
Time to check for updates vs remote repo ?s

There are alternative strategies to organizing the source into smaller repos, such as one per major subdirectory or module, eg: http://git.kernel.org/ We could (maybe) start with something simple like

kent/src/inc & kent/src/lib        --> kent-src-lib.git
kent/src/hg/inc & kent/src/hg/lib  --> kent-src-hg-lib.git
kent/src/hg                        --> kent-src-hg.git
kent/src/util                      --> kent-src-util.git

I am not sure how revision histories are tracked when code is pushed from one repo to another but I think the histories are maintained.