What does Genbank contain?

From genomewiki
Jump to navigationJump to search

Genbank is a manual collection of original sequences, submitted by individual labs. It is almost as as old as molecular biology and therefore evolved together with molecular biology. It is split into divisons (high-throughput, survey, primates, etc) which are not strictly followed. Each entry contains a sequence, some additional annotations (like EST clone numbers, PubMed IDs, address of lab, etc) the type (RNA or DNA) and the organism. The biggest organism in Genbank is human.

Human Genbank RNA is easy: apart from a few errors, it contains mostly cDNA and is used to create genes. cDNA are either partial (<1kb) or full-length. They are submitted by individual authors (of older articles, before the big-data biology era) and later sequencing centers that clone cDNAs with robots into plasmids and sequenced them.

DNA in Genbank is more diverse. In my experience, Human Genbank DNA consists of short fragments, ends, full-length and partial full-length sequences.

  • short fragments are cloned PCR products of mostly exons. They are rarely longer than 1kb. They make up the majority of Genbank submissions, but not of its sequence size. A prototypical example is the sequence of an exon involved in some genetic disease, which includes the mutation. These genbank submissions are done by individual authors, are typically accompanied by references to a single article and the authors submit usually less than ~30 sequences. The sequences were submitted because journals require it and to preserve the sequence information in a standard format. They are only mapped on the NCBI genome browser (is this still true?).
  • ends: some projects clone a fragment, but then sequence only the ends of it. The cloning can be done with BACs, Fosmids, cosmids or plasmids (BAC<150kb, Cosmid<20kb, Fosmid<40kb, plasmid<10kb). The genbank submissions include typically hundreds to thousands of sequences, as two separate records. They are submitted by projects, not authors and were submitted by core-facilities. The ends are used to find a BAC for a given region, then order the frozen clone from a supplier. Most of them are mapped on the UCSC genome browser.
  • full-length: These can be full length sequences of cosmids,bacs,fosmids or cDNAs (=genes). These sequences were cloned, then shredded into smaller pieces, sequenced and assembled. Some of them contain gaps, that the assembler could not bridge, the gaps are indicated by 100 N characters. The biggest contributor to the full-length type was the human genome project, because they sequenced full-length BACs that they then put together to make the genome. Other main contributors were project like MGC, to build gene models. Others were projects that are interested in a given single clone for a project. These contain the bulk of sequence information in genbank, concentrated on only a few hundred submitters. They are mapped on the NCBI genome browser.
  • partial full-length: This is a BAC that was started to be sequenced, but wasn't finished. Reasons can include: the sequences indicated a mixed or hybrid clone or the sequences pointed towards a clone that had already been sequenced before. It is a long sequence of random unordered 1kb pieces, separated by stretches of 100 N-characters. Mostly produced by the human genome project, mostly junk by today's standards, so not mapped by any genome browser.

Some mis-classifications occurred: some cDNAs were artifically created by a program, so should have been called partial, but are declared full-length. Some fragments have not actually been observed but were created from genome sequences, so should have never been submitted at all. But these cases are extremely rare overall.

The mouse and zebrafish Genbank's have a similar structure as human. More exotic species consists mostly of their genome project output + many cDNAs contributed by gene sequencing projects or individual labs. Even more exotic species have only a few cDNAs in Genbank.