Finding nearby genes: Difference between revisions
Cath Tyner (talk | contribs) m (→Introduction) |
Cath Tyner (talk | contribs) m (→Alternatives) |
||
Line 21: | Line 21: | ||
* [http://main.g2.bx.psu.edu/ Galaxy] has a "Fetch closest non-overlapping feature" tool | * [http://main.g2.bx.psu.edu/ Galaxy] has a "Fetch closest non-overlapping feature" tool | ||
* the BedTools include a tool [http://bedtools.readthedocs.io/en/latest/content/example-usage.html#bedtools-closest "closestBed"] | * the BedTools include a tool [http://bedtools.readthedocs.io/en/latest/content/example-usage.html#bedtools-closest "closestBed"] | ||
* use the [https://genome.ucsc.edu/goldenpath/help/multiRegionHelp.html "Multi-Region tool"] to remove, or "slice out" intergenic regions. | |||
===Script for knownGene on hg18=== | ===Script for knownGene on hg18=== |
Revision as of 20:56, 26 May 2017
Introduction
If you are interested in a certain genomic position, or reference point, and you want to find a sample of nearby genes upstream and downstream from this position, you can create a script by copying one of the examples below. These scripts will find the nearest transcripts (upstream and downstream) from your reference point, and report the gene name also. The last two scripts (for hg19 & hg38) will also report the distance from the nearby transcripts and the reference point.
- Open your editor on the command line and create a script in your bin directory.
- E.g.,
vi closestGene.sh
- Paste one of the scripts below into your closestGene.sh file
- Run the script.
closestGene.sh
- Scripts assume that you have MySQL access. See more about the UCSC Genome Browser MySQL database.
Alternatives
- Galaxy has a "Fetch closest non-overlapping feature" tool
- the BedTools include a tool "closestBed"
- use the "Multi-Region tool" to remove, or "slice out" intergenic regions.
Script for knownGene on hg18
View table schema for knownGene, hg18
#!/bin/sh # given position chr1:710000-720000 # find a sample of genes near this upstream and downstream C=chr1 S=710000 E=720000 echo "three upstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}' ORDER BY e.txEnd DESC limit 3;' hg18 echo "three downstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit 3;' hg18
This produces the output:
three upstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 690107 | 703869 | uc001abo.1 | BC006361 | | chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 | | chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 | +------+--------+--------+------------+----------+ three downstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 752926 | 778860 | uc001abp.1 | BC102012 | | chr1 | 752926 | 778860 | uc001abq.1 | BC042880 | | chr1 | 752926 | 779603 | uc001abr.1 | CR601056 | +------+--------+--------+------------+----------+
Script for refGene on hg19
Here is a script for the gene set refGene on hg19. For this example, the output can be seen in this session, where the custom track labeled, "closest" are the regions in the MySQL output (the 10 closest transcripts upstream, and the 10 closest transcripts downstream). The other custom track, labeled, "distanceCheck" is derived from the last column in the SQL output, the number of bp that each transcript is from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
View the table schema for refGene, hg19
#!/bin/sh # for gene set refGene # given position chr1:991973-991973 # find a sample of genes near this upstream and downstream # Input your assembly G=hg19 # Input the chr for reference point C=chr1 # Input start for reference point S=991973 # Input end for reference point E=991973 # Input the number of nearby transcripts to output N=10 # This script uses the gene set refGene. # Any gene set can be used. If a different gene set is used, check that # the field names are the same, they may need updating. To check this, # go to the Table Browser, select your gene set, and click the link for # "table schema" to see field names. Older assemblies may use the related # kgXref table for gene alias/gene name. # The last column is the distance from the comparison point. echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" echo "last column is distance from reference point to transcript, ${S} - txEnd" echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd AS "'${S}'-txEnd" FROM refGene e, kgXref j WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'" ORDER BY e.txEnd DESC limit 10;' $G echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" echo "last column is distance from reference point to transcript, ${E} - txStart" echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart AS "'${E}'-txStart" FROM refGene e, kgXref j WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit 10;' $G
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txEnd is the 5' end, the transcription start site +-------+---------+--------+--------+--------------+------------+--------------+ | chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txEnd | +-------+---------+--------+--------+--------------+------------+--------------+ | chr1 | 955502 | 991499 | + | NM_198576 | AGRN | 474 | | chr1 | 948846 | 949919 | + | NM_005101 | ISG15 | 42054 | | chr1 | 934341 | 935552 | - | NM_021170 | HES4 | 56421 | | chr1 | 934343 | 935552 | - | NM_001142467 | HES4 | 56421 | | chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | chr1 | 901876 | 910484 | + | NM_001160184 | PLEKHN1 | 81489 | | chr1 | 895966 | 901099 | + | NM_198317 | KLHL17 | 90874 | | chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | | chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | +-------+---------+--------+--------+--------------+------------+--------------+ closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set last column is distance from reference point to transcript, 991973 - txStart Note: for reverse - strand items, txStart is the 3' end, not transcription start site +-------+---------+---------+--------+--------------+------------+----------------+ | chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txStart | +-------+---------+---------+--------+--------------+------------+----------------+ | chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1072396 | 1079434 | + | NR_038869 | LOC254099 | -80423 | | chr1 | 1102483 | 1102578 | + | NR_029639 | MIR200B | -110510 | | chr1 | 1103242 | 1103332 | + | NR_029834 | MIR200A | -111269 | | chr1 | 1104384 | 1104467 | + | NR_029957 | MIR429 | -112411 | | chr1 | 1109285 | 1133313 | + | NM_001130045 | TTLL10 | -117312 | +-------+---------+---------+--------+--------------+------------+----------------+
Script for ncbiRefSeq on hg38
Here is a script for the gene set ncbiRefSeq on hg38.
Note that the last column in the SQL output is the distance, or the number of bp that each transcript is, from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
View table schema for ncbiRefSeq, hg38
#!/bin/sh # for gene set ncbiRefSeq # given position chr1:991973-991973 # find a sample of genes near this upstream and downstream # Input your assembly G=hg38 # Input the chr for reference point C=chr1 # Input start for reference point S=991973 # Input end for reference point E=991973 # Input the number of nearby transcripts to output N=10 # Any gene set can be used. If a different gene set is used, check that # the field names are the same, they may need updating. To check this, # go to the Table Browser, select your gene set, and click the link for # "table schema" to see field names. Older assemblies may use the related # kgXref table for gene alias/gene name. # The last column is the distance from the comparison point. echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set" echo "last column is distance from reference point to transcript, ${S} - txEnd" echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd AS "'${S}'-txEnd" FROM ncbiRefSeq e, ncbiRefSeqLink j WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'" ORDER BY e.txEnd DESC limit '${N}';' $G echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set" echo "last column is distance from reference point to transcript, ${E} - txEnd" echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart AS "'${E}'-txStart" FROM ncbiRefSeq e, ncbiRefSeqLink j WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit '${N}';' $G
This produces the output:
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txEnd is the 5' end, the transcription start site +-------+---------+--------+--------+----------------+---------+--------------+ | chrom | txStart | txEnd | strand | name | name | 991973-txEnd | +-------+---------+--------+--------+----------------+---------+--------------+ | chr1 | 975198 | 982117 | - | NM_001291367.1 | PERM1 | 9856 | | chr1 | 975198 | 982117 | - | NM_001291366.1 | PERM1 | 9856 | | chr1 | 975198 | 982093 | - | XM_017002583.1 | PERM1 | 9880 | | chr1 | 975198 | 982021 | - | XM_017002584.1 | PERM1 | 9952 | | chr1 | 975197 | 981657 | - | XM_017002585.1 | PERM1 | 10316 | | chr1 | 966496 | 975108 | + | NM_032129.2 | PLEKHN1 | 16865 | | chr1 | 966496 | 975108 | + | NM_001160184.1 | PLEKHN1 | 16865 | | chr1 | 965819 | 974587 | + | XM_006710944.3 | PLEKHN1 | 17386 | | chr1 | 965819 | 974587 | + | XM_017002476.1 | PLEKHN1 | 17386 | | chr1 | 965819 | 974587 | + | XM_017002474.1 | PLEKHN1 | 17386 | +-------+---------+--------+--------+----------------+---------+--------------+ closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txStart is the 3' end, not the transcription start site +-------+---------+---------+--------+----------------+--------------+----------------+ | chrom | txStart | txEnd | strand | name | name | 991973-txStart | +-------+---------+---------+--------+----------------+--------------+----------------+ | chr1 | 998961 | 1000172 | - | NM_021170.3 | HES4 | -6988 | | chr1 | 998961 | 1001052 | - | XM_005244771.4 | HES4 | -6988 | | chr1 | 998963 | 1000172 | - | NM_001142467.1 | HES4 | -6990 | | chr1 | 1013466 | 1014540 | + | NM_005101.3 | ISG15 | -21493 | | chr1 | 1020101 | 1056119 | + | XM_011541429.2 | AGRN | -28128 | | chr1 | 1020101 | 1056119 | + | XR_946650.2 | AGRN | -28128 | | chr1 | 1020101 | 1056119 | + | XM_005244749.3 | AGRN | -28128 | | chr1 | 1020122 | 1056119 | + | NM_198576.3 | AGRN | -28149 | | chr1 | 1020122 | 1056119 | + | NM_001305275.1 | AGRN | -28149 | | chr1 | 1059706 | 1066441 | + | XR_001737601.1 | LOC100288175 | -67733 | +-------+---------+---------+--------+----------------+--------------+----------------+