Step 1: Finding a Specific Gene or Protein to Study
Searching with words that describe or label the sequence
The initial search option is a keyword search against the text of the data records. Thus, it suffers from the same limitations as all keyword searches, such as misspellings and synonyms. Most genes and gene products can be described by several text strings. In this example, we will try to find an enzyme in the folate biosynthesis pathway that has several common names, but one specific EC number. The name of the gene that encodes the target enzyme has been named by several groups working on different organisms. Any of the following terms may be used to describe the target enzyme:
- 7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase
- hydroxymethylpterin pyrophosphokinase
- HPPK
- pyrophosphokinase
- sulD
- folK
- folate biosynthesis
- EC 2.7.6.3
- 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase
Please open the NMPDR home page in a new tab or window, or download pdf of this tutorial. Use your favorite strategy to compose a keyword search. Some of these terms will result in no hits, while some result in hundreds. Neither option is useful. As with all keyword searches, there is an appropriate subset of the above terms that will return the record of interest.
There are three ways to limit the scope of a keyword search to a subset of NMPDR core organisms. First, if you start on an organism summary page, simple keyword searches are limited to that group of organisms. Try, for example, searching for "EC 2.7.6.3" from the Campylobacter page, linked in the left column.
Second, from the Organism Data Summaries navigation link, you may select any organism and view its statistics page, which contains a search box that is limited to the chosen organism. The statistics page also provides direct links to lists of genes that have or have not been included in subsystems by NMPDR curators.
Third, the "Search" option in the left navigation panel provides access to an advanced search form for genes, which includes a list of genomes for limiting your keyword search. Select single genomes by cliking, or multiple genomes by clicking while holding control or shift.
Searching the sequence data directly with BLAST
The BLAST entry page is also available from the homepage or search link. BLAST uses a DNA or amino acid sequence as the query term instead of one or more text strings.
Suppose you did not know the EC number for our example enzyme, HPPK, and a search with your first choice of common name returned no usable results. But, you have the amino acid sequence of the E.coli version:
>E.coli K12 HPPK MTVAYIAIGSNLASPLEQVNAALKALGDIPESHILTVSSFYRTPPLGPQDQPDYLNAAVA LETSLAPEELLNHTQRIELQQGRVRKAERWGPRTLDLDIMLFGNEVINTERLTVPHYDMK NRGFMLWPLFEIAPELVFPDGEMLRQILHTRAFDKLNKW
Copy the sequence above and paste it into the box on the advanced search page. From the scrolling menu, choose any organism of interest to BLAST against. Multiple genomes may be selected by using the control or shift buttons as you click. Buttons are also provided for selecting all NMPDR focus genomes, or all of the supporting genomes. Since this is an amino acid sequence, set the tool to blastp, then click the button labeled "BLAST." The table of BLAST results returned is ranked by score, with the most significant hits at the top of the results table. The top entry in the table of returned results is most likely to be the target protein.
You may also use a nucleotide sequence to find your gene of interest:
>E.coli K12 HPPK gene atgacagtggcgtatattgccataggcagcaatctggcctctccgctggagcaggtcaat gctgccctgaaagcattaggcgatatccctgaaagccacattcttaccgtttcttcgttt taccgcaccccaccgctggggccgcaagatcaacccgattacttaaacgcagccgtggcg ctggaaacctctcttgcacctgaagagctactcaatcacacacagcgtattgaattgcag caaggtcgcgtccgcaaagctgaacgctggggaccacgcacgctggatctcgacatcatg ctgtttggtaatgaagtgataaatactgaacgcctgaccgttccgcactacgatatgaag aatcgtggatttatgctgtggccgctgtttgaaatcgcgccggagttggtgtttcctgat ggggagatgttgcgtcaaatcttacatacaagagcatttgacaaattaaacaaatggtaa
Select the blastx tool, which translates the nucleotide sequence and compares the result to proteins in the database to find matching genes.
The gene product that you want to study may be located in the NMPDR by searching for one or more text strings in a keyword search, or by searching directly for the protein or nucleic acid sequence using blast. The results of these searches are presented in a table with links to the GBrowse environment, which will allow you to walk the chromosome surrounding the gene, and to the NMPDR environment for comparative analysis of genomes.
Please see the next lesson, Navigating NMPDR, for an explanation of the tools available on the NMPDR protein page.