Finding nearby genes: Difference between revisions

From genomewiki
Jump to navigationJump to search
Line 2: Line 2:
If you are interested in a certain genomic position, or reference point, and you want to find a sample
If you are interested in a certain genomic position, or reference point, and you want to find a sample
of nearby genes upstream and downstream from this position, you can create a script by copying
of nearby genes upstream and downstream from this position, you can create a script by copying
one of the examples below.  
one of the examples below. These scripts will find the nearest transcripts (upstream and downstream)
from your reference point, and report the gene name also. The last two scripts (for hg19 & hg38) will
also report the distance from the nearby transcripts and the reference point.  


* Open your editor on the command line and create a script in your bin directory.  
* Open your editor on the command line and create a script in your bin directory.  

Revision as of 20:36, 26 May 2017

Introduction

If you are interested in a certain genomic position, or reference point, and you want to find a sample of nearby genes upstream and downstream from this position, you can create a script by copying one of the examples below. These scripts will find the nearest transcripts (upstream and downstream) from your reference point, and report the gene name also. The last two scripts (for hg19 & hg38) will also report the distance from the nearby transcripts and the reference point.

  • Open your editor on the command line and create a script in your bin directory.
  • E.g.,
vi closestGene.sh
  • Paste one of the scripts below into your closestGene.sh file
  • Run the script.
closestGene.sh

Alternatives

  • Galaxy has a "Fetch closest non-overlapping feature" tool
  • the BedTools include a tool "closestBed"

Script for knownGene on hg18

View table schema for knownGene, hg18

#!/bin/sh

# given position chr1:710000-720000
# find a sample of genes near this upstream and downstream
C=chr1
S=710000
E=720000

echo "three upstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}'
ORDER BY e.txEnd DESC limit 3;' hg18

echo "three downstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 3;' hg18

This produces the output:

three upstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 690107 | 703869 | uc001abo.1 | BC006361 |
| chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 |
| chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 |
+------+--------+--------+------------+----------+
three downstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 752926 | 778860 | uc001abp.1 | BC102012 |
| chr1 | 752926 | 778860 | uc001abq.1 | BC042880 |
| chr1 | 752926 | 779603 | uc001abr.1 | CR601056 |
+------+--------+--------+------------+----------+


Script for refGene on hg19

Here is a script for the gene set refGene on hg19. For this example, the output can be seen in this session, where the custom track labeled, "closest" are the regions in the MySQL output (the 10 closest transcripts upstream, and the 10 closest transcripts downstream). The other custom track, labeled, "distanceCheck" is derived from the last column in the SQL output, the number of bp that each transcript is from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.

View the table schema for refGene, hg19

#!/bin/sh

# for gene set refGene
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg19
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# This script uses the gene set refGene.
# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.

echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit 10;' $G

echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${E} - txStart"
echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 10;' $G
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+--------------+------------+--------------+
| chrom | txStart | txEnd  | strand | name         | geneSymbol | 991973-txEnd |
+-------+---------+--------+--------+--------------+------------+--------------+
| chr1  |  955502 | 991499 | +      | NM_198576    | AGRN       |          474 |
| chr1  |  948846 | 949919 | +      | NM_005101    | ISG15      |        42054 |
| chr1  |  934341 | 935552 | -      | NM_021170    | HES4       |        56421 |
| chr1  |  934343 | 935552 | -      | NM_001142467 | HES4       |        56421 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_001160184 | PLEKHN1    |        81489 |
| chr1  |  895966 | 901099 | +      | NM_198317    | KLHL17     |        90874 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
+-------+---------+--------+--------+--------------+------------+--------------+
closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txStart
Note: for reverse - strand items, txStart is the 3' end, not transcription start site
+-------+---------+---------+--------+--------------+------------+----------------+
| chrom | txStart | txEnd   | strand | name         | geneSymbol | 991973-txStart |
+-------+---------+---------+--------+--------------+------------+----------------+
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1072396 | 1079434 | +      | NR_038869    | LOC254099  |         -80423 |
| chr1  | 1102483 | 1102578 | +      | NR_029639    | MIR200B    |        -110510 |
| chr1  | 1103242 | 1103332 | +      | NR_029834    | MIR200A    |        -111269 |
| chr1  | 1104384 | 1104467 | +      | NR_029957    | MIR429     |        -112411 |
| chr1  | 1109285 | 1133313 | +      | NM_001130045 | TTLL10     |        -117312 |
+-------+---------+---------+--------+--------------+------------+----------------+

Script for ncbiRefSeq on hg38

Here is a script for the gene set ncbiRefSeq on hg38.

Note that the last column in the SQL output is the distance, or the number of bp that each transcript is, from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.

View table schema for ncbiRefSeq, hg38

#!/bin/sh

# for gene set ncbiRefSeq
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg38
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit '${N}';' $G


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${E} - txEnd"
echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit '${N}';' $G

This produces the output:

closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+----------------+---------+--------------+
| chrom | txStart | txEnd  | strand | name           | name    | 991973-txEnd |
+-------+---------+--------+--------+----------------+---------+--------------+
| chr1  |  975198 | 982117 | -      | NM_001291367.1 | PERM1   |         9856 |
| chr1  |  975198 | 982117 | -      | NM_001291366.1 | PERM1   |         9856 |
| chr1  |  975198 | 982093 | -      | XM_017002583.1 | PERM1   |         9880 |
| chr1  |  975198 | 982021 | -      | XM_017002584.1 | PERM1   |         9952 |
| chr1  |  975197 | 981657 | -      | XM_017002585.1 | PERM1   |        10316 |
| chr1  |  966496 | 975108 | +      | NM_032129.2    | PLEKHN1 |        16865 |
| chr1  |  966496 | 975108 | +      | NM_001160184.1 | PLEKHN1 |        16865 |
| chr1  |  965819 | 974587 | +      | XM_006710944.3 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002476.1 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002474.1 | PLEKHN1 |        17386 |
+-------+---------+--------+--------+----------------+---------+--------------+
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txStart is the 3' end, not the transcription start site
+-------+---------+---------+--------+----------------+--------------+----------------+
| chrom | txStart | txEnd   | strand | name           | name         | 991973-txStart |
+-------+---------+---------+--------+----------------+--------------+----------------+
| chr1  |  998961 | 1000172 | -      | NM_021170.3    | HES4         |          -6988 |
| chr1  |  998961 | 1001052 | -      | XM_005244771.4 | HES4         |          -6988 |
| chr1  |  998963 | 1000172 | -      | NM_001142467.1 | HES4         |          -6990 |
| chr1  | 1013466 | 1014540 | +      | NM_005101.3    | ISG15        |         -21493 |
| chr1  | 1020101 | 1056119 | +      | XM_011541429.2 | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XR_946650.2    | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XM_005244749.3 | AGRN         |         -28128 |
| chr1  | 1020122 | 1056119 | +      | NM_198576.3    | AGRN         |         -28149 |
| chr1  | 1020122 | 1056119 | +      | NM_001305275.1 | AGRN         |         -28149 |
| chr1  | 1059706 | 1066441 | +      | XR_001737601.1 | LOC100288175 |         -67733 |
+-------+---------+---------+--------+----------------+--------------+----------------+