BoG2013VariationPoster: Difference between revisions

From genomewiki
Jump to navigationJump to search
(→‎Common Gene Haplotype Alleles: tried to make methods more clear, added notes about scoring and the reference variant, added scoring section)
Line 20: Line 20:
See the [http://hgwdev-demo3.cse.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Rhead&hgS_otherUserSessionName=BoG2013VariationPoster development version]. Click on any protein-coding gene in the '''UCSC Genes''' track and scroll to the '''Common Gene Haplotype Alleles''' section.  (The feature is currently implemented only on GRCh37/hg19 protein-coding genes.)
See the [http://hgwdev-demo3.cse.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Rhead&hgS_otherUserSessionName=BoG2013VariationPoster development version]. Click on any protein-coding gene in the '''UCSC Genes''' track and scroll to the '''Common Gene Haplotype Alleles''' section.  (The feature is currently implemented only on GRCh37/hg19 protein-coding genes.)


For each protein-coding gene in the UCSC Genes track, variant data from the 2,184 phased chromosomes in the [http://www.1000genomes.org/ 1000 Genomes Project] have been distilled into distinct haplotype alleles. Each haplotype allele is generated from GRCh37/hg19 reference DNA, with 1000 Genomes Project DNA variants spliced in, then translated into amino acids.
Phase 1 of the [http://www.1000genomes.org/ 1000 Genomes Project] included 1092 individual genomes.  For each protein-coding gene in the UCSC Genes track, variant data from the 2,184 (per autosome) phased chromosomes have been distilled into distinct haplotype alleles, or distinct sets of variants found on at least one of the 1000 Genomes subject chromosomes.  


===Usage tips===
===Usage tips===


* By default, only non-synonymous, common (occurring in at least 1% of haploytope alleles) variants are displayed.  Including all variants in the display will generate the list of all haplotypes found in 1000 Genomes participants, though many of these haplotypes may have no protein coding effect.  Including all variants will also update haplotype and homozygous frequency calculations.
* By default, only non-synonymous, common variants are displyed.  Common variants occur in at least 1% of 1000 Genome subject chromosomes.  Including all variants in the display will generate the list of all haplotypes found in 1000 Genomes participants, though many of these haplotypes may have no protein coding effect.  Note that haplotype and homozygous frequency calculations depend upon which variants are included.


* By default, only common (occurring with a frequency of more than 1%) haplotype alleles are displayed.
* By default, only common haplotype alleles are displayed.  Common haplotypes occur in at least 1% of 1000 Genome subject chromosomes.


* If the reference variant is present among the haplotype alleles generated from the 1000 Genomes data, it will be labeled as such in the "Reference  Variants" column.
* There may be no "reference haplotype" (made of entirely reference variants) represented in the 1000 Genomes data. If there is, it will be marked as "reference" in the table of haplotypes.


* When the full sequence is displayed, columns with variants are highlighted by green vertical lines.  The effects of variants are highlighted by bolded red letters.  Synonymous changes are only evident when DNA bases are displayed.
* When the full sequence is displayed, columns with variants are highlighted by green vertical lines.  The effects of variants are highlighted by bolded red letters.  Synonymous changes are only evident when DNA bases are displayed.  Each haplotype allele sequence is generated from GRCh37/hg19 reference DNA, with 1000 Genomes Project DNA variants spliced in, then translated into amino acids.


* All columns are sortable.
* All columns are sortable.  Sorting on a variant while the full sequence is displayed will highlight that variant with a vertical blue line.


* Hovering your mouse over numbers in the "haplotype frequency" and "homozygous frequency" columns will show you the actual count of alleles (e.g., "370 of 2184").
* Hovering your mouse over numbers in the "haplotype frequency" and "homozygous frequency" columns will show you the actual count of alleles (e.g., "N=370 of 2184").


* Hovering your mouse over some buttons displays hints.
* Hovering your mouse over some buttons displays hints.


* Clicking on variants in the summary section takes you to the corresponding track details pages of the [http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=tgpPhase1 1000G Ph1 Vars] track.
* Clicking on non-reference variants in the summary section takes you to the corresponding track details pages of the [http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=tgpPhase1 1000G Ph1 Vars] track.


* Clicking the "Display distribution" button will show the distribution of each haplotype allele among major population groups.  Optionally display the distribution of each allele among the [http://www.1000genomes.org/about#ProjectSamples groups defined by the 1000 Genomes Project].
* Clicking the "Display population" button will show the distribution of each haplotype allele among major population groups.  Optionally display the distribution of each allele among the [http://www.1000genomes.org/about#ProjectSamples groups defined by the 1000 Genomes Project].


* By default, scoring is hidden.  Three types of scores are provided to help users find haplotype alleles that occur more or less frequently than expected or that have unusual distributions in populations.  See definitions below.
* By default, scoring is hidden.  Three types of scores are provided to help users find haplotype alleles that occur more or less frequently than expected or that have unusual distributions in populations.  See definitions below.
Line 46: Line 46:
===Scoring definitions===
===Scoring definitions===


* '''Hap score''':  
* '''Hap score''': The haplotype score is the normalized (-log10) probability of finding exactly N subject chromosomes with this haplotype, given the proportions of individual variants.  The score is normalized by dividing by the total number of variants.  Normalization allows comparing the scores between genes with many variants and those with few.  The score will be positive if the haplotype is more frequent than expected by chance and negative if less frequent.


* '''Hom score''':  
* '''Hom score''': The homozygous score is the (-log10) probability of finding exactly N individuals with this haplotype on both chromosomes, given the actual frequency of the haplotype in subject chromosomes.  The score will be positive if the haplotype is found homozygous in more and negative when in fewer individuals than expected.  Negative values might suggest that the haplotype is deleterious when homozygous. scores


* '''Pop score''' (only visible when population distributions are displayed):
* '''Pop score''' (only visible when population distributions are displayed): The population skew score is the variance between population groups divided by N, the number occurrences of the haplotype.  The most frequently occurring haplotypes will potentially have larger scores, but if N is small, a skew in population distribution is not unexpected.


==How to get help==
==How to get help==

Revision as of 00:50, 27 April 2013

This page contains links related to the UCSC Genome Browser poster presented by Brooke Rhead at Biology of Genomes 2013 [1])

Poster: New variation resources at the UCSC Genome Browser

This poster presents a first look at two new UCSC Genome Browser features for assessing variation. Both features will be released to the public website in the coming months.

Variant Annotation Integrator

See the development version.

In order to assist researchers in annotating and prioritizing thousands of variant calls from sequencing projects, we are developing the Variant Annotation Integrator (VAI) and anticipate a first public release by the end of June 2013. There are several existing tools that can annotate variant calls with predicted functional effects on protein-coding genes and regulatory regions, for example Ensembl's Variant Effect Predictor (VEP). However, these tools are usually restricted to one or two sources of gene annotations and a limited set of additional annotation sources. The VAI will offer much broader choices from the full UCSC database and user-provided custom tracks.

The first release of the VAI will include a simple user interface for selecting variants to annotate as well as the most commonly used annotation sources: protein-coding genes, regulatory regions, predictions from tools such as SIFT and PolyPhen2 provided by the Database of Non-Synonymous Functional Predictions (dbNSFP), and already-discovered variants from dbSNP. The simple user interface will also provide several options for filtering variants based on annotations. A link to an advanced user interface will enable sophisticated users to add annotation sources from the full database.

Common Gene Haplotype Alleles

See the development version. Click on any protein-coding gene in the UCSC Genes track and scroll to the Common Gene Haplotype Alleles section. (The feature is currently implemented only on GRCh37/hg19 protein-coding genes.)

Phase 1 of the 1000 Genomes Project included 1092 individual genomes. For each protein-coding gene in the UCSC Genes track, variant data from the 2,184 (per autosome) phased chromosomes have been distilled into distinct haplotype alleles, or distinct sets of variants found on at least one of the 1000 Genomes subject chromosomes.

Usage tips

  • By default, only non-synonymous, common variants are displyed. Common variants occur in at least 1% of 1000 Genome subject chromosomes. Including all variants in the display will generate the list of all haplotypes found in 1000 Genomes participants, though many of these haplotypes may have no protein coding effect. Note that haplotype and homozygous frequency calculations depend upon which variants are included.
  • By default, only common haplotype alleles are displayed. Common haplotypes occur in at least 1% of 1000 Genome subject chromosomes.
  • There may be no "reference haplotype" (made of entirely reference variants) represented in the 1000 Genomes data. If there is, it will be marked as "reference" in the table of haplotypes.
  • When the full sequence is displayed, columns with variants are highlighted by green vertical lines. The effects of variants are highlighted by bolded red letters. Synonymous changes are only evident when DNA bases are displayed. Each haplotype allele sequence is generated from GRCh37/hg19 reference DNA, with 1000 Genomes Project DNA variants spliced in, then translated into amino acids.
  • All columns are sortable. Sorting on a variant while the full sequence is displayed will highlight that variant with a vertical blue line.
  • Hovering your mouse over numbers in the "haplotype frequency" and "homozygous frequency" columns will show you the actual count of alleles (e.g., "N=370 of 2184").
  • Hovering your mouse over some buttons displays hints.
  • Clicking on non-reference variants in the summary section takes you to the corresponding track details pages of the 1000G Ph1 Vars track.
  • Clicking the "Display population" button will show the distribution of each haplotype allele among major population groups. Optionally display the distribution of each allele among the groups defined by the 1000 Genomes Project.
  • By default, scoring is hidden. Three types of scores are provided to help users find haplotype alleles that occur more or less frequently than expected or that have unusual distributions in populations. See definitions below.

Scoring definitions

  • Hap score: The haplotype score is the normalized (-log10) probability of finding exactly N subject chromosomes with this haplotype, given the proportions of individual variants. The score is normalized by dividing by the total number of variants. Normalization allows comparing the scores between genes with many variants and those with few. The score will be positive if the haplotype is more frequent than expected by chance and negative if less frequent.
  • Hom score: The homozygous score is the (-log10) probability of finding exactly N individuals with this haplotype on both chromosomes, given the actual frequency of the haplotype in subject chromosomes. The score will be positive if the haplotype is found homozygous in more and negative when in fewer individuals than expected. Negative values might suggest that the haplotype is deleterious when homozygous. scores
  • Pop score (only visible when population distributions are displayed): The population skew score is the variance between population groups divided by N, the number occurrences of the haplotype. The most frequently occurring haplotypes will potentially have larger scores, but if N is small, a skew in population distribution is not unexpected.

How to get help

Other posters about the UCSC Genome Browser

  • Using the UCSC Genome Browser to evaluate putative genetic variants. Hinrichs AS et al. Biology of Genomes, 2012. genomewiki page .pptx, PDF
  • Visually integrating genomic data in the UCSC Genome Browser. Hinrichs AS et al. HGV 2011. genomewiki page .pptx, PDF
  • UCSC Genome Browser Data Hubs. Zweig AS et al. Biology of Genomes, 2011. PDF
  • Genome-wide ENCODE Data at UCSC. Rosenbloom KR et al. ASHG, 2010. PPT
  • UCSC Genome Browser Tool Suite. Hinrichs AS et al. Genomics of Common Disease, 2008: .ppt, PDF
  • More Presentations