Gene Set Summary Statistics

From genomewiki
Revision as of 23:12, 14 September 2007 by Hiram (talk | contribs) (→‎Methods)
Jump to navigationJump to search

gene sets measured

  • hg17 - knownGenes version 2
  • hg18 - knownGenes version 3
  • mm8 - knownGenes version 2
  • mm9 - knownGenes version 3

The min, max and mean measurements are per gene

summary of gene and exon counts

dbgene
count
total exon
count
min exon
count
max exon
count
mean exon
count
hg1739368405720114910
hg1856722519308128999
mm831863314628131310
mm94922041711416108

summary of exon size statistics

dbsum exon
sizes
min exon
size
max exon
size
mean exon
size
hg17106839720118172263
hg18146371091136861282
mm883159087417497264
mm9117671086129698282

summary of intron size statistics

dbsum intron
sizes
min intron
size
max intron
size
mean intron
size
hg172223224397610964506069
hg182784923600110473206023
mm81476081990913475505220
mm92055504784112534305589

Top five exon count genes

dbgene name (exon count)
hg17 NM_004543 (149) AF535142 (146) AF535142 (146) NM_033071 (146) AF495910 (146)
hg18 uc001yrq.1 (2899) uc002zvw.1 (322) uc002umr.1 (313) uc002stk.1 (217) uc002umt.1 (194)
mm8 NM_011652 (313) NM_028004 (192) NM_007738 (118) NM_134448 (99) DQ067088 (99)
mm9 uc007pgj.1 (610) uc008kfn.1 (313) uc008kfo.1 (192) uc008jqv.1 (157) uc009rrh.1 (118)


Top five largest CDS extent genes

dbgene name (CDS extent size: thickEnd-thickStart)
hg17 NM_014141 (2298740) NM_000109 (2217347) CR749820 (2138880) NM_004006 (2089394) X14298 (2089394)
hg18 uc003weu.1 (2298740) uc004ddb.1 (2217347) uc001pak.1 (2138880) uc004dda.1 (2089394) uc003wqd.1 (2055833)
mm8 NM_007868 (2253366) NM_001004357 (2238304) NM_053011 (2055883) AK134694 (1988713) NM_053171 (1639258)
mm9 uc009tri.1 (2253366) uc009bst.1 (2238325) uc007zfr.1 (2189582) uc008jon.1 (2055883) uc008mpv.1 (1988713)

Top five smallest transcript genes

dbgene name (transcript size: txEnd-txStart)
hg17 AF241539 (168) AF277175 (176) AY459291 (240) AY605064 (243) AF503918 (258)
hg18 uc004buj.1 (20) uc001dcm.1 (22) uc001seo.1 (22) uc001sqn.1 (22) uc002wpa.1 (22)
mm8 AJ319753 (217) BC107019 (231) BC016221 (286) NM_130876 (303) NM_130873 (304)
mm9 uc007bma.1 (22) uc007gmr.1 (22) uc007khz.1 (22) uc007pay.1 (22) uc007qpn.1 (22)


Methods

  • From the table browser, request three different bed files for the knownGenes track:
  1. whole gene
  2. exons only
  3. introns only
  • From those bed files, stats can be extracted
  1. gene count from: 'wc -l wholeGene.bed'
  2. exon count stats from:
 STATS=`ave -col=10 wholeGene.bed -tableOut | grep -v "^#"`
 MIN=`echo $STATS | cut -d' ' -f1`
 MAX=`echo $STATS | cut -d' ' -f5`
 MEAN=`echo $STATS | cut -d' ' -f6 | awk '{printf "%d", $1+0.5}'`
 COUNT=`echo $STATS | cut -d' ' -f8 | awk '{printf "%d", $1}'`
  • for exon or intron size stats:
 STATS=`awk '{print $3-$2}' {introns,exons}.bed \
      | ave -col=1 stdin -tableOut | grep -v "^#"`
 MIN=`echo $STATS | cut -d' ' -f1`
 MAX=`echo $STATS | cut -d' ' -f5 | awk '{printf "%d", $1}'`
 MEAN=`echo $STATS | cut -d' ' -f6 | awk '{printf "%d", $1+0.5}'`
 SUM_SIZE=`awk '{sum += $3-$2} END{printf "%d", sum}' {introns,exons}.bed`
  • top five exon count genes
sort -k10nr wholeGene.bed | head -5
  • top five CDS size genes
awk '{cdsSize=$8-$7
if (cdsSize > 0) {printf "%s\t%s\t%s\t%s\t%d\n", $1,$2,$3,$4,cdsSize}
}' wholeGene.bed | sort -k5nr | head -5
  • top five smallest transcript genes
awk '{size=$3-$2
if (size > 0) {printf "%s\t%s\t%s\t%s\t%d\n", $1,$2,$3,$4,size}
}' wholeGene.bed | sort -k5n | head -5