Finding nearby genes
Introduction
If you are interested in a certain genomic position, or reference point, and you want to find a sample of nearby genes upstream and downstream from this position, you can create a script by copying one of the examples below. These scripts will find the nearest transcripts (upstream and downstream) from your reference point, and report the gene name also. The last two scripts (for hg19 & hg38) will also report the distance from the nearby transcripts and the reference point.
- Open your editor on the command line and create a script in your bin directory.
- E.g.,
vi closestGene.sh
- Paste one of the scripts below into your closestGene.sh file
- Of course, make sure your script has the proper permissions to be executable:
chmod +x closestGene.sh
- Run the script.
closestGene.sh
- Scripts assume that you have MySQL access (installed MySQL and a .hg.conf file). See more about the UCSC Genome Browser MySQL database.
Alternatives
Galaxy
- Galaxy has a "Fetch closest non-overlapping feature" tool.
- Use the Table Browser, select a track, and click on the link, "Galaxy" (next to "output format") to send output to Galaxy.
- Use the "Fetch closest non-overlapping feature for every interval" under the "Operate on Genomic Intervals" menu in Galaxy
- For questions using Galaxy, contact Galaxy help: https://biostar.usegalaxy.org/
- Galaxy posts that may help: post1 post 2
BedTools
- The BedTools include a tool "closestBed"
Multi-Region
- Use the "Multi-Region tool" to remove, or "slice out" intergenic regions in the browser, allowing you to visualize a region with a "gene-only" (or exon-only) view. Currently, the multi-region option does not provide a way to download the gene-only or exon-only regions you are viewing in the browser.
Template MySQL Query
All of the below example scripts are just specialized or slightly modified versions of the following template MySQL command, where all of the variables within ${} are customizable parameters:
mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select \ table1.chrom, table1.${chromStart}, table1.${chromEnd}, table1.strand, table1.name, table2.name as geneSymbol from ${tblName1} table1,\ ${tblName2} table2 where table1.name = table2.id AND table1.chrom='${chrom}' AND \ ((table1.${chromStart} >= ${refStart} - ${range} AND table1.${chromStart} <= ${refEnd} + ${range}) OR \ (table1.${chromEnd} >= ${refStart} - ${range} AND table1.${chromEnd} <= ${refEnd} + ${range})) \ order by table1.${chromEnd} desc " $db
The optional paramters are explained below, where the value after the '=' sign indicates an example value:
chromStart="txStart" # field name of the transcript start for the primary table chromEnd="txEnd" # field name of the transcript end for the primary table tblName1="ncbiRefSeqCurated" # primary table name that stores the transcript coordinates tblName2="ncbiRefSeqLink" # optional secondary table with geneSymbol information chrom="chr1" # chromosome of interest range="10000" # optional range outside of interest point refStart="166167154" # start coordinate of range of interest refEnd="166167602" # end coordinate of range of interest db="hg38" # database of interest
The above query, with the above example values, finds all transcripts in the ncbiRefSeqCurated table within 10kb of chr1:166167154-166167602, which is an example enhancer region:
+-------+-----------+-----------+--------+----------------+------------+ | chrom | txStart | txEnd | strand | name | geneSymbol | +-------+-----------+-----------+--------+----------------+------------+ | chr1 | 166055917 | 166166755 | - | NR_135199.1 | FAM78B | | chr1 | 166055917 | 166166755 | - | NM_001320302.1 | FAM78B | | chr1 | 166069298 | 166166755 | - | NM_001017961.4 | FAM78B | +-------+-----------+-----------+--------+----------------+------------+
Both the example query and the example parameters are intended to be directly pasted into a bash shell and/or modified to suit your needs. The example scripts below this page are all essentially variations on the above information, specialized for specific applications (downstream or upstream only, etc).
Examples
"Nearest gene" script for knownGene on hg18
- This script will find the closest transcripts to a reference point region.
- View table schema for knownGene, hg18
#!/bin/sh # given position chr1:710000-720000 # find a sample of genes near this upstream and downstream C=chr1 S=710000 E=720000 echo "three upstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}' ORDER BY e.txEnd DESC limit 3;' hg18 echo "three downstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit 3;' hg18
This produces the output:
three upstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 690107 | 703869 | uc001abo.1 | BC006361 | | chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 | | chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 | +------+--------+--------+------------+----------+ three downstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 752926 | 778860 | uc001abp.1 | BC102012 | | chr1 | 752926 | 778860 | uc001abq.1 | BC042880 | | chr1 | 752926 | 779603 | uc001abr.1 | CR601056 | +------+--------+--------+------------+----------+
"Nearest gene" script for refGene on hg19
- This script will find the closest transcripts to a reference point region for the gene set refGene on hg19.
- For this example, the output can be seen in this session, where the custom track labeled, "closest" are the regions in the MySQL output (the 10 closest transcripts upstream, and the 10 closest transcripts downstream). The other custom track, labeled, "distanceCheck" is derived from the last column in the SQL output, the number of bp that each transcript is from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
#!/bin/sh # for gene set refGene # given position chr1:991973-991973 # find a sample of genes near this upstream and downstream # Input your assembly G=hg19 # Input the chr for reference point C=chr1 # Input start for reference point S=991973 # Input end for reference point E=991973 # Input the number of nearby transcripts to output N=10 # This script uses the gene set refGene. # Any gene set can be used. If a different gene set is used, check that # the field names are the same, they may need updating. To check this, # go to the Table Browser, select your gene set, and click the link for # "table schema" to see field names. Older assemblies may use the related # kgXref table for gene alias/gene name. # The last column is the distance from the comparison point. echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" echo "last column is distance from reference point to transcript, ${S} - txEnd" echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd AS "'${S}'-txEnd" FROM refGene e, kgXref j WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'" ORDER BY e.txEnd DESC limit 10;' $G echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" echo "last column is distance from reference point to transcript, ${E} - txStart" echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart AS "'${E}'-txStart" FROM refGene e, kgXref j WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit 10;' $G
This produces the output:
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txEnd is the 5' end, the transcription start site +-------+---------+--------+--------+--------------+------------+--------------+ | chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txEnd | +-------+---------+--------+--------+--------------+------------+--------------+ | chr1 | 955502 | 991499 | + | NM_198576 | AGRN | 474 | | chr1 | 948846 | 949919 | + | NM_005101 | ISG15 | 42054 | | chr1 | 934341 | 935552 | - | NM_021170 | HES4 | 56421 | | chr1 | 934343 | 935552 | - | NM_001142467 | HES4 | 56421 | | chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | chr1 | 901876 | 910484 | + | NM_001160184 | PLEKHN1 | 81489 | | chr1 | 895966 | 901099 | + | NM_198317 | KLHL17 | 90874 | | chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | | chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | +-------+---------+--------+--------+--------------+------------+--------------+ closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set last column is distance from reference point to transcript, 991973 - txStart Note: for reverse - strand items, txStart is the 3' end, not transcription start site +-------+---------+---------+--------+--------------+------------+----------------+ | chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txStart | +-------+---------+---------+--------+--------------+------------+----------------+ | chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1072396 | 1079434 | + | NR_038869 | LOC254099 | -80423 | | chr1 | 1102483 | 1102578 | + | NR_029639 | MIR200B | -110510 | | chr1 | 1103242 | 1103332 | + | NR_029834 | MIR200A | -111269 | | chr1 | 1104384 | 1104467 | + | NR_029957 | MIR429 | -112411 | | chr1 | 1109285 | 1133313 | + | NM_001130045 | TTLL10 | -117312 | +-------+---------+---------+--------+--------------+------------+----------------+
"Nearest gene" script for ncbiRefSeq on hg38
- * This script will find the closest transcripts to a reference point region for the gene set ncbiRefSeq on hg38.
- Note that the last column in the SQL output is the distance, or the number of bp that each transcript is, from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
#!/bin/sh # for gene set ncbiRefSeq # given position chr1:991973-991973 # find a sample of genes near this upstream and downstream # Input your assembly G=hg38 # Input the chr for reference point C=chr1 # Input start for reference point S=991973 # Input end for reference point E=991973 # Input the number of nearby transcripts to output N=10 # Any gene set can be used. If a different gene set is used, check that # the field names are the same, they may need updating. To check this, # go to the Table Browser, select your gene set, and click the link for # "table schema" to see field names. Older assemblies may use the related # kgXref table for gene alias/gene name. # The last column is the distance from the comparison point. echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set" echo "last column is distance from reference point to transcript, ${S} - txEnd" echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd AS "'${S}'-txEnd" FROM ncbiRefSeq e, ncbiRefSeqLink j WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'" ORDER BY e.txEnd DESC limit '${N}';' $G echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set" echo "last column is distance from reference point to transcript, ${E} - txEnd" echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart AS "'${E}'-txStart" FROM ncbiRefSeq e, ncbiRefSeqLink j WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit '${N}';' $G
This produces the output:
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txEnd is the 5' end, the transcription start site +-------+---------+--------+--------+----------------+---------+--------------+ | chrom | txStart | txEnd | strand | name | name | 991973-txEnd | +-------+---------+--------+--------+----------------+---------+--------------+ | chr1 | 975198 | 982117 | - | NM_001291367.1 | PERM1 | 9856 | | chr1 | 975198 | 982117 | - | NM_001291366.1 | PERM1 | 9856 | | chr1 | 975198 | 982093 | - | XM_017002583.1 | PERM1 | 9880 | | chr1 | 975198 | 982021 | - | XM_017002584.1 | PERM1 | 9952 | | chr1 | 975197 | 981657 | - | XM_017002585.1 | PERM1 | 10316 | | chr1 | 966496 | 975108 | + | NM_032129.2 | PLEKHN1 | 16865 | | chr1 | 966496 | 975108 | + | NM_001160184.1 | PLEKHN1 | 16865 | | chr1 | 965819 | 974587 | + | XM_006710944.3 | PLEKHN1 | 17386 | | chr1 | 965819 | 974587 | + | XM_017002476.1 | PLEKHN1 | 17386 | | chr1 | 965819 | 974587 | + | XM_017002474.1 | PLEKHN1 | 17386 | +-------+---------+--------+--------+----------------+---------+--------------+ closest downstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txStart is the 3' end, not the transcription start site +-------+---------+---------+--------+----------------+--------------+----------------+ | chrom | txStart | txEnd | strand | name | name | 991973-txStart | +-------+---------+---------+--------+----------------+--------------+----------------+ | chr1 | 998961 | 1000172 | - | NM_021170.3 | HES4 | -6988 | | chr1 | 998961 | 1001052 | - | XM_005244771.4 | HES4 | -6988 | | chr1 | 998963 | 1000172 | - | NM_001142467.1 | HES4 | -6990 | | chr1 | 1013466 | 1014540 | + | NM_005101.3 | ISG15 | -21493 | | chr1 | 1020101 | 1056119 | + | XM_011541429.2 | AGRN | -28128 | | chr1 | 1020101 | 1056119 | + | XR_946650.2 | AGRN | -28128 | | chr1 | 1020101 | 1056119 | + | XM_005244749.3 | AGRN | -28128 | | chr1 | 1020122 | 1056119 | + | NM_198576.3 | AGRN | -28149 | | chr1 | 1020122 | 1056119 | + | NM_001305275.1 | AGRN | -28149 | | chr1 | 1059706 | 1066441 | + | XR_001737601.1 | LOC100288175 | -67733 | +-------+---------+---------+--------+----------------+--------------+----------------+