Step 1: Finding a Specific Gene or Protein to Study
Searching with words that describe or label the sequence
Simple keyword searching (advanced blast scan)
The initial search option, which is presented on the home page as well as in the header of data pages as a text box with a "Go" button, is a keyword search against the text of the data records. Thus, it suffers from the same limitations as all keyword searches, such as misspellings and synonyms. Most genes and gene products can be described by several text strings. In this example, we will try to find an enzyme in the folate biosynthesis pathway that has several common names, but one specific EC number. The name of the gene that encodes the target enzyme has been named by several groups working on different organisms. Any of the following terms may be used to describe the target enzyme:
- 7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase
- hydroxymethylpterin pyrophosphokinase
- HPPK
- pyrophosphokinase
- sulD
- folK
- folate biosynthesis
- EC 2.7.6.3
- 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase
Use your favorite strategy to compose a keyword search in the box below (or on the NMPDR home page). Some of these terms will result in no hits, while some result in hundreds. Neither option is useful. A new search form is presented at the bottom of the search results table so that you may revise your search. As with all keyword searches, there is an appropriate subset of the above terms that will return the record of interest. (Use the back button on your browser to resume this tutorial.)
Keywords can include gene IDs (gi|16802272), gene names (folK), EC numbers (2.7.6.3), genus (Vibrio), species (vulnificus), words contained in subsystem names (synthesis), functional assignments (pyrophosphokinase), and subsystem classes (cofactors). You may also use attributes like iedb, virulence, and essential. A list of protein encoding genes that match all of the keywords will be returned.
To search for genes matching only some of the keywords, surround the optional words with parentheses. For example, 2.7.6.3 4.1.2.25 would match only bifunctional genes associated with both EC numbers 2.7.6.3 and 4.1.2.25, while (2.7.6.3) (4.1.2.25) would match the bifunctional genes as well as all single function genes with either of those EC numbers. Use a minus sign to exclude genes matching a particular keyword. For example, pyrophosphokinase -2-amino-4-hydroxy-6-hydroxymethyldihydropteridine would match all pyrophosphokinases acting on substrates other than 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine.
Restricting keyword search to selected organisms
There are several ways to limit the scope of a keyword search to organisms of interest to you. First, you may simply include the organism genus and/or species and/or strain name among the keywords entered in the simple keyword search box.
Second, if you start on one of the five NMPDR organism summary pages, simple keyword searches are automatically limited to that group of organisms. Try, for example, searching for EC 2.7.6.3 from the Campylobacter page, linked in the left navigation bar.
Third, from the Organism Data Summaries navigation link, you may select any single organism and view its statistics page, which contains a search box that is limited to that organism. The statistics page also provides direct links to lists of genes that have or have not been included in subsystems by NMPDR curators.
Finally, the "Genes" option below the simple keyword search box provides access to an advanced search form, which is also accessible from the "Advanced" button that appears in the short form at the bottom of keyword search results. The advanced search options include a list of genomes for limiting your keyword search and a menu of subsystems that may be used to restrict your keyword search.
Genomes are grouped in the genomes list with the NMPDR focus organisms listed first, followed by the Archaea (blue), Bacteria (pink), and Eukarya (yellow). Within groups, genomes are alphabetized. Select a single genome directly by clicking on its name in the list box. To select multiple genomes, hold down the CTRL key while clicking. To select a range of genomes, hold down the SHIFT key while clicking. Selected genomes appear in the box below the buttons as they are selected.
It is also possible to select all genomes whose name includes text you type into the form. For example, if you type pneumoniae into the box and click the button, "Select genomes containing," all genomes that contain "pneumoniae" in the name will be selected, including species of Streptococcus and Chlamydophila, as well as Mycoplasma hypopneumoniae. You can also type an NCBI taxonomy ID into the box: 171101 will select Streptococcus pneumoniae R6.
Use the buttons, Select All to select all the genomes, Clear All to de-select all the genomes, Select NMPDR to select all the NMPDR focus genomes, and Select Supporting to select all except the NMPDR focus genomes.
Searching the sequence data directly
BLAST -- Sequence alignment searching (^top)
The BLAST family of tools use local sequence alignments to search for matching sequences in the database. BLAST uses a DNA or amino acid sequence as the query term instead of one or more keywords.
Suppose you did not know the EC number for our example enzyme, HPPK, and a search with your first choice of common name returned no usable results. But, you have the amino acid sequence of the E.coli version:
>E.coli K12 HPPK MTVAYIAIGSNLASPLEQVNAALKALGDIPESHILTVSSFYRTPPLGPQDQPDYLNAAVA LETSLAPEELLNHTQRIELQQGRVRKAERWGPRTLDLDIMLFGNEVINTERLTVPHYDMK NRGFMLWPLFEIAPELVFPDGEMLRQILHTRAFDKLNKW
Copy the sequence above and paste it into the sequence box on the sequence search page. Since this is an amino acid sequence, set the tool to blastp. From the scrolling menu, choose any organism of interest to BLAST against. Multiple genomes may be selected by using the control or shift buttons as you click. Buttons are also provided for selecting all NMPDR focus genomes, or all of the supporting genomes. Click the button labeled "BLAST." The table of BLAST results returned is ranked by score, with the most significant hits at the top of the results table. The top entry in the table of returned results is most likely to be the target protein.
You may also use a nucleotide sequence to find your gene of interest:
>E.coli K12 HPPK gene atgacagtggcgtatattgccataggcagcaatctggcctctccgctggagcaggtcaat gctgccctgaaagcattaggcgatatccctgaaagccacattcttaccgtttcttcgttt taccgcaccccaccgctggggccgcaagatcaacccgattacttaaacgcagccgtggcg ctggaaacctctcttgcacctgaagagctactcaatcacacacagcgtattgaattgcag caaggtcgcgtccgcaaagctgaacgctggggaccacgcacgctggatctcgacatcatg ctgtttggtaatgaagtgataaatactgaacgcctgaccgttccgcactacgatatgaag aatcgtggatttatgctgtggccgctgtttgaaatcgcgccggagttggtgtttcctgat ggggagatgttgcgtcaaatcttacatacaagagcatttgacaaattaaacaaatggtaa
If you are interested in finding many orthologs of the query sequence, select the blastx tool, which translates the nucleotide sequence and compares the result to proteins in the database to find matching genes.
If you want to find the data page for the exact sequence you entered, then select the blastn tool, which will match the query (input) nucleotide sequence with nucleotide sequences in the database. The small number of characters and the degeneracy of the genetic code causes blastn to find shorter matching sequences than blastx will find with the same query.
Scan -- Sequence pattern, or motif, searching (^top)
Another way to search for proteins or genes is to make use of known sequence patterns, or motifs, that are characteristic of a a functional group of proteins. For example, a signature of HPPK enzymes has been defined by ProSite as this: [KRHD]-x-[GA]-[PSAE]-R-x(2)-D-[LIV]-D-[LIVM](2). Such a sequence is more commonly written in the text of a journal article, for example, as: (KRHD)X(GA)(PSAE)RXXD(LIV)D(LIVM)(LIVM).
The abstract instruction conveyed by the pattern is, "One of either lysine or arginine or histidine or aspartate, followed by any single amino acid, followed by either glycine or alanine, then one of these four, then arginine, then any two amino acids, then aspartate, then one of these three, then aspartate, then one of these four, then one of the same four again." All of the following three examples convey the same instruction:
any(KRHD) x any(GA) any(PSAE) RxxD any(LIV) D any(LIVM) any(LIVM)
any(KRHD) 1...1 any(GA) any(PSAE) R 2...2 D any(LIV) D any(LIVM) any(LIVM)
((K | (R | (H | D))) X (G | A) (P | (S | (A | E))) RXXD (L | (I | V)) D (L | (I | (V | M))) (L | (I | (V | M)))
The word "any" must be in lower case letters to indicate a choice because those three letters stand for amino acids when presented in all caps. A space should separate elements of the pattern. The letter "X" is the wild card and specifies any of the 20 amino acids. The choice of any amino acid may also be indicated by the number of amino acids required and three dots to represent the ellipsis. For example, both "XX" and "2...2" mean any two amino acids. However, "2...4" means any two or three or four amino acids. The third way to indicate a choice is by the use of nested parentheses and the symbol, "|" commonly used as "or" in computer science. This is not a lower-case letter L nor an upper-case letter i. It is sometimes called a pipe, and is usually "shift\" on the keyboard.
Try copying any of the three patterns into the sequence box on the sequence search page. Since this is an amino acid sequence, select protScan from the tool menu. Use the genomes list to select organisms to search, then click the Scan button.
Nucleic acid patterns may used as input with the dnaScan tool. Pattern rules for spacing are similar to those for amino acid patterns. Limited options in degenerate positions are indicated using the IUB single letter code for degenerate sequences:
| M | A or C (aMino) |
| R | A or G (puRine) |
| W | A or T (Weak, 2 H-bonds) |
| S | C or G (Strong, 3 H-bonds) |
| Y | C or T (pYrimidine) |
| K | G or T (Keto) |
| V | A or C or G (not T; V > T) |
| H | A or C or T (not G; H > G) |
| D | A or G or T (not C; D > C) |
| B | C or G or T (not A; B > A) |
| N | A or C or G or T |
Summary
The gene product that you want to study may be located in the NMPDR by searching for one or more text strings in a keyword search, or by searching directly for the protein or nucleic acid sequence using BLAST or Scan. The results of these searches are presented in a table with links to the GBrowse environment, which will allow you to walk the chromosome surrounding the gene, and to the NMPDR environment for comparative analysis of genomes. Pattern scan results are presented with a link to a Context viewer instead of GBrowse.
Please see the next lesson, Navigating NMPDR, for an explanation of the tools available on the NMPDR protein page.
^top