Finding nearby genes

From genomewiki
Jump to: navigation, search

Introduction

If you are interested in a certain genomic position, or reference point, and you want to find a sample of nearby genes upstream and downstream from this position, you can create a script by copying one of the examples below. These scripts will find the nearest transcripts (upstream and downstream) from your reference point, and report the gene name also. The last two scripts (for hg19 & hg38) will also report the distance from the nearby transcripts and the reference point.

  • Open your editor on the command line and create a script in your bin directory.
  • E.g.,
vi closestGene.sh
  • Paste one of the scripts below into your closestGene.sh file
  • Of course, make sure your script has the proper permissions to be executable:
chmod +x closestGene.sh
  • Run the script.
closestGene.sh

Alternatives

Galaxy

BedTools

Multi-Region

  • Use the "Multi-Region tool" to remove, or "slice out" intergenic regions in the browser, allowing you to visualize a region with a "gene-only" (or exon-only) view. Currently, the multi-region option does not provide a way to download the gene-only or exon-only regions you are viewing in the browser.

Template MySQL Query

All of the below example scripts are just specialized or slightly modified versions of the following template MySQL command, where all of the variables within ${} are customizable parameters:

mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select \
   table1.chrom, table1.${chromStart}, table1.${chromEnd}, table1.strand, table1.name, table2.name as geneSymbol from ${tblName1} table1,\
    ${tblName2} table2 where table1.name = table2.id AND table1.chrom='${chrom}' AND \
        ((table1.${chromStart} >= ${refStart} - ${range} AND table1.${chromStart} <= ${refEnd} + ${range}) OR \
        (table1.${chromEnd} >= ${refStart} - ${range} AND table1.${chromEnd} <= ${refEnd} + ${range})) \
  order by table1.${chromEnd} desc " $db

The optional paramters are explained below, where the value after the '=' sign indicates an example value:

chromStart="txStart"          # field name of the transcript start for the primary table
chromEnd="txEnd"              # field name of the transcript end for the primary table
tblName1="ncbiRefSeqCurated"  # primary table name that stores the transcript coordinates
tblName2="ncbiRefSeqLink"     # optional secondary table with geneSymbol information
chrom="chr1"                  # chromosome of interest
range="10000"                 # optional range outside of interest point
refStart="166167154"          # start coordinate of range of interest
refEnd="166167602"            # end coordinate of range of interest
db="hg38"                     # database of interest

The above query, with the above example values, finds all transcripts in the ncbiRefSeqCurated table within 10kb of chr1:166167154-166167602, which is an example enhancer region:

+-------+-----------+-----------+--------+----------------+------------+
| chrom | txStart   | txEnd     | strand | name           | geneSymbol |
+-------+-----------+-----------+--------+----------------+------------+
| chr1  | 166055917 | 166166755 | -      | NR_135199.1    | FAM78B     |
| chr1  | 166055917 | 166166755 | -      | NM_001320302.1 | FAM78B     |
| chr1  | 166069298 | 166166755 | -      | NM_001017961.4 | FAM78B     |
+-------+-----------+-----------+--------+----------------+------------+

Both the example query and the example parameters are intended to be directly pasted into a bash shell and/or modified to suit your needs. The example scripts below this page are all essentially variations on the above information, specialized for specific applications (downstream or upstream only, etc).

Examples

"Nearest gene" script for knownGene on hg18

#!/bin/sh

# given position chr1:710000-720000
# find a sample of genes near this upstream and downstream
C=chr1
S=710000
E=720000

echo "three upstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}'
ORDER BY e.txEnd DESC limit 3;' hg18

echo "three downstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 3;' hg18

This produces the output:

three upstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 690107 | 703869 | uc001abo.1 | BC006361 |
| chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 |
| chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 |
+------+--------+--------+------------+----------+
three downstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 752926 | 778860 | uc001abp.1 | BC102012 |
| chr1 | 752926 | 778860 | uc001abq.1 | BC042880 |
| chr1 | 752926 | 779603 | uc001abr.1 | CR601056 |
+------+--------+--------+------------+----------+

"Nearest gene" script for refGene on hg19

  • This script will find the closest transcripts to a reference point region for the gene set refGene on hg19.
  • For this example, the output can be seen in this session, where the custom track labeled, "closest" are the regions in the MySQL output (the 10 closest transcripts upstream, and the 10 closest transcripts downstream). The other custom track, labeled, "distanceCheck" is derived from the last column in the SQL output, the number of bp that each transcript is from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
#!/bin/sh

# for gene set refGene
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg19
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# This script uses the gene set refGene.
# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.

echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit 10;' $G

echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${E} - txStart"
echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 10;' $G

This produces the output:

closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+--------------+------------+--------------+
| chrom | txStart | txEnd  | strand | name         | geneSymbol | 991973-txEnd |
+-------+---------+--------+--------+--------------+------------+--------------+
| chr1  |  955502 | 991499 | +      | NM_198576    | AGRN       |          474 |
| chr1  |  948846 | 949919 | +      | NM_005101    | ISG15      |        42054 |
| chr1  |  934341 | 935552 | -      | NM_021170    | HES4       |        56421 |
| chr1  |  934343 | 935552 | -      | NM_001142467 | HES4       |        56421 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_001160184 | PLEKHN1    |        81489 |
| chr1  |  895966 | 901099 | +      | NM_198317    | KLHL17     |        90874 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
+-------+---------+--------+--------+--------------+------------+--------------+
closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txStart
Note: for reverse - strand items, txStart is the 3' end, not transcription start site
+-------+---------+---------+--------+--------------+------------+----------------+
| chrom | txStart | txEnd   | strand | name         | geneSymbol | 991973-txStart |
+-------+---------+---------+--------+--------------+------------+----------------+
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1072396 | 1079434 | +      | NR_038869    | LOC254099  |         -80423 |
| chr1  | 1102483 | 1102578 | +      | NR_029639    | MIR200B    |        -110510 |
| chr1  | 1103242 | 1103332 | +      | NR_029834    | MIR200A    |        -111269 |
| chr1  | 1104384 | 1104467 | +      | NR_029957    | MIR429     |        -112411 |
| chr1  | 1109285 | 1133313 | +      | NM_001130045 | TTLL10     |        -117312 |
+-------+---------+---------+--------+--------------+------------+----------------+

"Nearest gene" script for ncbiRefSeq on hg38

  • * This script will find the closest transcripts to a reference point region for the gene set ncbiRefSeq on hg38.
  • Note that the last column in the SQL output is the distance, or the number of bp that each transcript is, from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
#!/bin/sh

# for gene set ncbiRefSeq
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg38
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit '${N}';' $G


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${E} - txEnd"
echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit '${N}';' $G

This produces the output:

closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+----------------+---------+--------------+
| chrom | txStart | txEnd  | strand | name           | name    | 991973-txEnd |
+-------+---------+--------+--------+----------------+---------+--------------+
| chr1  |  975198 | 982117 | -      | NM_001291367.1 | PERM1   |         9856 |
| chr1  |  975198 | 982117 | -      | NM_001291366.1 | PERM1   |         9856 |
| chr1  |  975198 | 982093 | -      | XM_017002583.1 | PERM1   |         9880 |
| chr1  |  975198 | 982021 | -      | XM_017002584.1 | PERM1   |         9952 |
| chr1  |  975197 | 981657 | -      | XM_017002585.1 | PERM1   |        10316 |
| chr1  |  966496 | 975108 | +      | NM_032129.2    | PLEKHN1 |        16865 |
| chr1  |  966496 | 975108 | +      | NM_001160184.1 | PLEKHN1 |        16865 |
| chr1  |  965819 | 974587 | +      | XM_006710944.3 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002476.1 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002474.1 | PLEKHN1 |        17386 |
+-------+---------+--------+--------+----------------+---------+--------------+
closest downstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txStart is the 3' end, not the transcription start site
+-------+---------+---------+--------+----------------+--------------+----------------+
| chrom | txStart | txEnd   | strand | name           | name         | 991973-txStart |
+-------+---------+---------+--------+----------------+--------------+----------------+
| chr1  |  998961 | 1000172 | -      | NM_021170.3    | HES4         |          -6988 |
| chr1  |  998961 | 1001052 | -      | XM_005244771.4 | HES4         |          -6988 |
| chr1  |  998963 | 1000172 | -      | NM_001142467.1 | HES4         |          -6990 |
| chr1  | 1013466 | 1014540 | +      | NM_005101.3    | ISG15        |         -21493 |
| chr1  | 1020101 | 1056119 | +      | XM_011541429.2 | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XR_946650.2    | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XM_005244749.3 | AGRN         |         -28128 |
| chr1  | 1020122 | 1056119 | +      | NM_198576.3    | AGRN         |         -28149 |
| chr1  | 1020122 | 1056119 | +      | NM_001305275.1 | AGRN         |         -28149 |
| chr1  | 1059706 | 1066441 | +      | XR_001737601.1 | LOC100288175 |         -67733 |
+-------+---------+---------+--------+----------------+--------------+----------------+