VGP Assembly gap analysis: Difference between revisions

From genomewiki
Jump to navigationJump to search
(initial contents)
 
(clarification)
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
==VGP assembly gap annotation vs. AGP file==
I'm curious about the lack of gap annotation in the VGP genome assemblies.  Looking at the AGP files supplied
with the assemblies, for example:
  [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/007/474/595/GCA_007474595.1_mLynCan4_v1.p/GCA_007474595.1_mLynCan4_v1.p_assembly_structure/Primary_Assembly/assembled_chromosomes/AGP/  GCA_007474595.1_mLynCan4_v1.p]
the chr*.comp.agp.gz files have no references at all to the gaps in the assembly.  Of the 24 assemblies I have seen to date, only one has
any gap annotation in the AGP files:
  [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/246/225/GCF_900246225.1_fAstCal1.2/GCF_900246225.1_fAstCal1.2_assembly_structure/Primary_Assembly/assembled_chromosomes/AGP/ GCF_900246225.1_fAstCal1.2]
with the following characteristics of the annotated gaps:
{| border='1' style='border-collapse:collapse'
|-
! style='text-align:left;'| Assembly ID
UCSC browser link
! style='text-align:right;'| assembly
size
! style='text-align:right;'| sequence
count
! style='text-align:right;'| number
of gaps
! total
gap size
! smallest
gap
! largest
gap
! median
size
! mean
size
! common name
VGP link
|-
| [https://genome.ucsc.edu/cgi-bin/hgTracks?hubUrl=https://hgdownload.soe.ucsc.edu/hubs/VGP/hub.txt&genome=GCF_900246225.1_fAstCal1.2&position=lastDbPos GCF_900246225.1_fAstCal1.2]
| style="text-align:right;"| 880,445,564
| style="text-align:right;"| 249
| style="text-align:right;"| 115
| style="text-align:right;"| 11,500
| style="text-align:right;"| 100
| style="text-align:right;"| 100
| style="text-align:right;"| 100
| style="text-align:right;"| 100.0
| [https://vgp.github.io/genomeark/Astatotilapia_calliptera/ eastern happy]
|}
The actual total gaps in this assembly are much more than indicated in these AGP file specifications.
The table below indicates the actual gaps (any sequence of unknown nucleotides) in the assemblies.  Clearly some of these assemblies have significant gaps and I suspect
some of them are ''non-bridged'' gaps.
==Gap statistics==
{| border='1' style='border-collapse:collapse' class='wikitable sortable'
{| border='1' style='border-collapse:collapse' class='wikitable sortable'
|-
|-

Latest revision as of 21:37, 23 August 2019

VGP assembly gap annotation vs. AGP file

I'm curious about the lack of gap annotation in the VGP genome assemblies. Looking at the AGP files supplied with the assemblies, for example:

 GCA_007474595.1_mLynCan4_v1.p

the chr*.comp.agp.gz files have no references at all to the gaps in the assembly. Of the 24 assemblies I have seen to date, only one has any gap annotation in the AGP files:

 GCF_900246225.1_fAstCal1.2

with the following characteristics of the annotated gaps:

Assembly ID

UCSC browser link

assembly

size

sequence

count

number

of gaps

total

gap size

smallest

gap

largest

gap

median

size

mean

size

common name

VGP link

GCF_900246225.1_fAstCal1.2 880,445,564 249 115 11,500 100 100 100 100.0 eastern happy

The actual total gaps in this assembly are much more than indicated in these AGP file specifications.

The table below indicates the actual gaps (any sequence of unknown nucleotides) in the assemblies. Clearly some of these assemblies have significant gaps and I suspect some of them are non-bridged gaps.

Gap statistics

Assembly ID

UCSC browser link

assembly

size

sequence

count

number

of gaps

total

gap size

smallest

gap

largest

gap

median

size

mean

size

common name

VGP link

GCA_003957555.2_bCalAnn1_v1.p 1,059,706,240 160 429 16,096,800 1 664,828 100 37,521.7 Anna's hummingbird
GCA_003957565.2_bTaeGut1_v1.p 1,058,012,133 135 312 2,334,250 1 535,815 200 7,481.6 zebra finch
GCA_004027225.1_bStrHab1_v1.p 1,165,639,803 100 362 27,568,500 1 9,491,740 500 76,156.1 owl parrot
GCA_004115265.2_mRhiFer1_v1.p 2,075,785,400 135 158 7,546,100 1 833,261 499 47,760.1 greater horseshoe bat
GCA_007364275.1_fArcCen1 932,947,025 189 744 16,903,500 1 1,459,710 100 22,719.7 flier cichlid
GCA_007399415.1_rGopEvg1_v1.p 2,298,564,209 383 562 29,560,300 1 3,059,780 100 52,598.4 Goodes thornscrub tortoise
GCA_007474595.1_mLynCan4_v1.p 2,408,900,816 67 848 2,778,740 1 378,177 100 3,276.8 Canada lynx
GCA_900324465.2_fAnaTes1.2 555,641,398 50 266 3,606,500 25 482,892 100 13,558.3 climbing perch
GCA_900324485.2_fMasArm1.2 591,935,101 122 238 13,231,300 2 607,827 300 55,593.9 zig-zag eel
GCA_901699155.1_bStrTur1.1 1,178,928,410 357 894 3,876,370 2 156,350 100 4,336.0 turtle dove
GCA_901709675.1_fSynAcu1.1 324,331,233 87 43 13,700 100 500 200 318.6 greater pipefish
GCA_901765095.1_aMicUni1.1 4,685,923,413 1,080 2,452 47,437,300 1 2,824,530 100 19,346.4 tiny Cayenne caecilian
GCF_004115215.1_mOrnAna1.p.v1 1,858,552,590 305 522 15,159,800 1 1,622,070 100 29,041.8 platypus
GCF_004126475.1_mPhyDis1_v1.p 2,117,764,065 141 831 22,966,500 1 1,480,060 500 27,637.1 pale spear-nosed bat
GCF_900246225.1_fAstCal1.2 880,445,564 249 490 1,186,660 19 211,000 25 2,421.8 eastern happy
GCF_900634415.1_fCotGob3.1 609,391,784 322 445 2,553,980 9 500,417 100 5,739.3 channel bull blenny
GCF_900634625.1_fParRan2.1 551,012,959 156 1,523 10,561,500 1 153,246 4,304 6,934.7 Indian glassy fish
GCF_900634775.1_fGouWil2.1 937,150,793 441 1,160 14,353,500 1 2,682,550 100 12,373.7 blunt-snouted clingfish
GCF_900700375.1_fDenClu1.1 567,401,054 460 464 4,648,390 13 518,837 100 10,018.1 denticle herring
GCF_900747795.1_fErpCal1.1 3,811,038,701 1,885 5,614 238,295,000 3 1,733,650 5,046 42,446.5 reedfish
GCF_900963305.1_fEcheNa1.1 544,229,245 38 140 599,356 13 248,230 100 4,281.1 live sharksucker
GCF_900964775.1_fSclFor1.1 784,563,014 72 145 41,360 13 14,756 100 285.2 Asian bonytongue
GCF_901000725.2_fTakRub1.2 384,126,662 128 402 3,688,530 10 428,276 100 9,175.5 torafugu
GCF_901001135.1_aRhiBiv1.1 5,319,239,201 1,330 3,573 33,955,800 4 771,063 100 9,503.5 two-lined caecilian