Frequently Asked Questions
- Can I blast multiple sequences at once against more than one genome?
- How do I save or download data?
- What is a feature?
- What is a subsystem?
- Is there an instruction manual?
- What practical use can a bench scientist make of comparative genomics?
- What's new in the latest release of the NMPDR?
- How do I find what I'm looking for?
- What genomes are supported in the NMPDR?
- What can I do in the GBrowse environment?
- What can I do in the NMPDR environment?
- What is meant by "functional coupling"?
- How can I find pathogenicity islands and prophages?
- What similarities are shown?
- What about links to other tools?
- How do I use the signature genes tool to search for genes that discriminate between two sets of organisms?
- Can I blast multiple sequences at once against more than one genome?
- How do I save or download data?
- To save the table of search results without the columns headed "GBrowse" and "NMPDR Protein Page," simply click on the link that reads "Click here to download the full search results." This will save all results, not just the first 50 displayed, as a tab-delimited text file that may be opened as a spreadsheet.
- To save the table of search results including active links to GBrowse and NMPDR Protein Pages, the procedure depends on your choice of browser. From Firefox or Safari, save the page as a complete or archive web page. When you open the local copy, the buttons will open GBrowse or NMPDR protein pages. From Internet Explorer, you may copy the results table and paste it into Excel; the buttons will open links to NMPDR. Only the results displayed will be saved, so you may want to reload the page with more results displayed per page.
- To save protein or gene sequences, just copy the shown fastA-formatted text and paste into a local file.
- To save the protein context graphic, just select or point to it and save it as an image. This is true also for the compare regions and pins displays.
- To save information about homologous sets of genes in the commentary for compare regions or pins, save the page as you would for the search results.
- To save the results of the signature genes tool, save the page as a complete or archive web page. Alternatively, you may copy the results table and paste it into Excel (this works best from IE).
- To download whole annotated genomes for each of the focus organisms in modified GFF3 format, see the downloads page. The formatted GFF3 files contain rows of records, each with nine tab-delimited fields: seqid, source, type, start, end, score, strand, phase, and attributes. The "score" and "phase" fields are not in use, so in each row, those fields contain the "." character. Each row describes a feature, which is a region on the DNA located between start and end nucleotide coordinates. To describe a protein-encoding gene, two rows are used to record two features at the same location: gene and CDS. FASTA formatted gene and protein sequences follow the tab-delimited table of feature annotations.
- What is a feature?
- What is a subsystem?
- Is there an instruction manual?
- What practical use can a bench scientist make of comparative genomics?
- Well-defined open problems (knowledge gaps) revealed by comparative genome analysis of various subsystems. The major types of such problems, "missing genes" or "functionally coupled hypotheticals," are listed at Help on How to Pick problem Types.
- Hypotheses, testable predictions pertaining to these problems.
- Records of experimental follow-up and comments on any of the suggested hypotheses, in a range from "intend to test" to "proven right/wrong."
- In specific functional context. For example, in general, we avoid recording questions like: "what is the function of this hypothetical protein?" On the other hand, "missing gene" questions of a form: "what gene encodes this enzyme in otherwise complete pathway?" are highly valued.
- Within the realm of comparative genome analysis. We realize that many interesting problems of biology do not fall in this category.
- Tractable. This requirement may filter out many problems (e.g., related to complex regulatory systems, etc.) that may not be addressed by conjectures amenable to straightforward experimental verification.
- What's new in the NMPDR?
- Another new Listeria genome along with 28 other supporting genomes. The new Listeria is a nonpathogenic species found in soil, water, food, and sewage.
- Ten new Listeria genomes and 15 new supporting genomes!
- Virtual structural proteomes are provided for one representative strain of each of the five groups of NMPDR focus pathogens. The virtual structural proteome is a list of all proteins with orthologous crystal structures identified as BLASTP hits against PDB.
- A new search engine with different, more flexible forms for submitting queries to the database.
- Help on how to save or download data is integrated into the NMPDR pages and presented as a new FAQ! Updated help documents and more references to published studies that used our resources are also provided.
- Essential genes and candidate drug targets derived from these data are presented on separate entry pages. The essential genes page provides entry to data from 14 independent high-throughput assessments of essentiality of genes in 10 different organisms. Essential genes may be searched and analyzed in the NMPDR protein page. The drug targets page provides entry to the first draft of our table of candidates.
Candidate targets are defined as proteins that have experimental evidence of essentiality, an experimentally determined, orthologous crystal structure, orthologs in the bacterial NIAID Priority Pathogens, and have been included in a subsystem by our curators. This list will expand to include virulence factors, as well as candidates for vaccine and antitoxin targets. Candidates will then be prioritized, and selected structures will be used for in silico screening. Results of these computerized molecular docking studies will be made available as they are generated. - New Organism Data Summaries page that serves as an entry point for any of the organisms in NMPDR, both the core set and the supporting genomes. Statistics returned for the selected organism include genome size, protein count, the number of subsystems curated for the organism, and the annotation status of proteins. A phylogenetic tree of all the organisms in the database is also linked from this new page.
- More resources, including pictures of pathogens, added to the resources links.
- Forums for user interaction established at the iLab of University of Illinois. These are linked at the top of each organism summary page. Presently only a bulletin board is implemented, but there is the capacity to add a document center for sharing lab protocols and an inquiry lab module for developing teaching materials. Please leave a note on the bulletin board if you are interested in these functions.
- Sources for obtaining strains and reagents have been added to the resource links.
- Genome annotation status tables appear on each of the organism summary pages. These tables provide links to lists of genes that are assigned functional names with roles in defined biological subsystems, as well as links to those genes assigned functional names but not yet assigned a role in a subsystem. These may be explored to discover conserved, "functional," clustering that may define new subsystems.
- How do I find what I'm looking for?
- By clicking on the
GBrowse option, you will get to a GBrowse-based graphical interface displaying the layout of genes on the chromosome. - By clicking on the
NMPDR option, you will get to a page that focuses on the specific protein. You will see a table that lists other proteins in the neighborhood of the target gene, with the target highlighted in green. A graphic of the genetic neighborhood is also presented, with the target gene in green, functionally coupled neighbors of the target in blue, and unrelated neighboring genes in red. Additionally, access to annotations, sequence, subsystems, and comparisons with genes in other genomes are offered. - What genomes are in the NMPDR?
- What can I do in the GBrowse environment?
- What can I do in the NMPDR environment?
- What is meant by "functional coupling"?
- How can I find pathogenicity islands and prophages?
- What similarities are shown?
- S1 and S2 are from different genomes (G1 ≠ G2),
- S1 is the most similar to S2 of all the sequences in the genome G1
- S2 is the most similar to S1 of all the sequences in the genome G2
- What about links to other tools?
- How does the signature genes tool work?
Yes, t o blast more than one sequence at a time, click the blast link from the home page, and paste all your fasta sequences into the box. It makes no difference whether there is an empty line between the different sequences, just as long as each sequence begins on a new line with a fasta header.
Then, in the genome selection part of the search form, you may select one or more genomes. Make sure you set the blast tool to blastp if your query sequences are proteins. The tool defaults to blastx, which requires the query sequences to be DNA. If you are blasting more than a handfull of sequences, you may want to increase the number of search results per page from the default of 50. Now click the blast button.
Results are returned in order of blast score, which is dependent on protein length, so the result of a lot of sequences blasted against a lot of genomes may be a bit messy. The more work there is to do, the longer the search will take. After the search is complete, there will be a link toward the top of the results page that says "right-click to save url for this search," which you can bookmark and use again to run the same search without having to paste any sequences.
To save the table of results from Firefox or Safari, you can save a local copy as a complete web page; copying and pasting into Excel doesn't work very well. If you do the search in IE, you can copy the table and paste into Excel with very good results. If you do this, then you can resort by functional name or organism.
^topA feature is anything that can be mapped onto a strand of DNA, and is defined by its start and stop location. A gene is a feature. A protein coding sequence (CDS or PEG) is a feature that, in bacteria, shares the same location on the DNA as its gene, but is represented as an amino acid sequence rather than a nucleotide sequence. Eukaryotic genes also have intron and exon features defined as subsets of the gene feature. Short regulatory elements or functional motifs may also be defined as features. Pathogenicity islands are features that include many genes. Operons are not presently annotated as features in NMPDR.
^topA subsystems has two components. First is a list of functional roles that are united by any common process or biologically meaningful organizing principle. Second is a spreadsheet, called a populated subsystem, which is a two-dimensional integration of biological functions with genome sequences. In the populated subsystem, functional roles are represented in columns, genomes are represented in rows, and cells of the spreadsheet are populated by the genes responsible for each function. Genes that are clustered on the chromosome share the same background color in the spreadsheet. Gene identification numbers are linked to NMPDR protein context pages. If multiple genes play the same functional role, the variants are named in the table of functional roles. The row number from that table is then appended to the gene number in the spreadsheet to identify which variant is used. Have a look at the Adhesins in Staphylococci as an example.
Subsystems may be accessed from the context page of a member protein by clicking the link in the biological context section of the page. Another way to find subsystems is to pick an organism from the list on the Subsystem Summaries page. After clicking on show subsystems, a metabolic reconstruction, or comprehensive list of subsystems and proteins that perform functional roles, is returned for the chosen genome. Subsystem headers link to populated subsystem (spreadsheet) displays, and proteins link to their respective context pages.
An investigator can learn much by establishing a subsystem of functions in genomes that are known to contain all the required genes, then using the computer to extend the subsystem to genomes about which less is known. NMPDR is used to browse subsystems established by our curators. The SEED may be used by investigators to create their own subsystems.
^topAn interactive user guide is being developed with WebCT at NCSA. Anyone may create a free account and access the materials, which include a PowerPoint presentation about our philosophy of annotation and the handout distributed at tutorial workshops.
The lessons on searching and navigating are also on the WebCT site and will be supplemented with self test questions soon.
^topHOPS: public depository of Hypotheses and Open Problems identified by Subsystem analysis
Comparative analysis of genomes reveals multiple gaps in our knowledge of basic biochemical and cellular processes. Accurate mapping of the revealed open problems within a framework of specific subsystems and groups of organisms sets the stage for generating hypotheses amenable to experimental testing. In a growing number of cases, predictions of novel genes and pathways revealed by comparative genomics techniques have been successfully verified. Based on this vision, the scope of the HOPS Database is to build and maintain a public repository of:
It is important to emphasize that we aim to restrict the breadth of open problems to those that are:
Likewise, we aim to accumulate predictions that provide a precisely defined and testable functional role, transformation or interaction. For this reason, "general class" functional predictions (e.g., putative kinase), are not the focus of our effort. By launching this site, we commit to populate it by problems and conjectures emerging from our effort to encode subsystems in the SEED environment, capturing many aspects of the Central Machinery of Life. Our goal is to share this information with the broad scientific community in order to encourage further computational and experimental analysis. Most importantly, we solicit community contributions to all three aspects of HOPS Database, which is meant to become a joint effort of bioinformaticians and experimentalists. It is important to emphasize that an experimental verification of a single gene carefully propagated via subsystems-based annotations, will often impact a significant number of genes in a variety of species.
^topAll genomes may be searched from the home page. To limit your search to one of the NMPDR core organism groups, you may start your search from one of the organism summary pages:
On each page you will find a search box where you can search for specific genes or proteins by text. To search by gene or protein sequence (using BLAST), choose the Search option in the left navigation panel. Search results are returned in a table that links to two options:
To understand the full functionality available in the two environments, you will need to take some time to experiment. Feel free to contact us with questions that you have. We are addding help text and smoothing out the interfaces as quickly as possible and largely in response to specific requests and suggestions; so please do take the time to formulate them.
^topThe NMPDR contains two classes of genomes -- those pathogens we are being funded to annotate, which we call "core genomes," and "supporting genomes," which include all publicly available genomes for comparative analysis. The table below lists , with strain designation and serotype when known, and closely related supporting genomes.
| Campylobacter | Listeria | Staphylococcus | Streptococcus | Vibrio |
|---|---|---|---|---|
| jejuni RM1221 | monocytogenes 1/2a EGD-e | aureus subsp. aureus COL | pneumoniae R6 unencapsulated | cholerae O1 ElTor str. N16961 |
| jejuni subsp. jejuni NCTC 11168 | monocytogenes 1/2a F6854 | aureus subsp. aureus MRSA252 | pneumoniae TIGR4 type 4 | cholerae O1 classical str. O395 |
| jejuni subsp. jejuni 81-176 | monocytogenes 1/2a F6900 | aureus subsp. aureus MSSA476 | pyogenes M1 GAS SF370 | cholerae O139 str. MO10 |
| jejuni subsp. jejuni 260.94 | monocytogenes 1/2a J0161 | aureus subsp. aureus MW2 | pyogenes M1 MGAS 5005 | cholerae non-O1 str. NRT36s |
| jejuni subsp. jejuni 84-25 | monocytogenes 1/2a J2818 | aureus subsp. aureus Mu50 | pyogenes M2 MGAS 10270 | parahaemolyticus RIMD 2210633 |
| jejuni subsp. jejuni CF93-6 | monocytogenes 1/2a 10403S | aureus subsp. aureus N315 | pyogenes M3 SSI-1 | vulnificus CMCP6 |
| jejuni subsp. jejuni HB93-13 | monocytogenes 1/2a1 FSL N3-165 | aureus subsp. aureus NCTC 8325 | pyogenes M3 MGAS 315 | vulnificus YJ016 |
| coli RM2228 | monocytogenes 1/2b FSL J1-194 | aureus subsp. aureus JH1 | pyogenes M4 MGAS 10750 | fischeri ES114 |
| fetus subsp. fetus 82-40 | monocytogenes 1/2b FSL R2-503 | aureus subsp. aureus JH9 | pyogenes M5 Manfredo | splendidus 12B01 |
| lari RM2100 | monocytogenes 4b Aureli 1997 HPB2262 | aureus subsp. aureus USA300 | pyogenes M6 MGAS 10394 | sp. MED222 |
| upsaliensis RM3195 | monocytogenes 4b F2365 | aureus RF122 | pyogenes M12 MGAS 2096 | sp. Ex25 O62 |
| monocytogenes 4b FSL N1-017 | epidermidis RP62A | pyogenes M12 MGAS 9429 | ||
| monocytogenes 4b H7858 | epidermidis ATCC12228 | pyogenes M18 MGAS 8232 | ||
| monocytogenes 4c FSL J2-071 | haemolyticus JCSC1435 | pyogenes M28 MGAS 6180 | ||
| innocua 6a Clip11262 | saprophyticus ATCC 15305 | agalactiae serotype V 2603V/R | ||
| welshimeri 6b SLCC5334 | agalactiae A909 | |||
| agalactiae NEM316 |
^top
Walking the chromosome
GBrowse is a tool developed for the Generic Model Organism Database (GMOD) Project, which we have adopted for use in NMPDR. The browser uses multiple tracks laid against the length of the genome, and each track contains specific information about the region on display. The displayed region is, by default, a 20 kbp window of the contig on which the gene of interest is located. The contig is represented by a ruler on which the displayed region is outlined in red. Most complete bacterial genomes will have a single long contig that represents a fully sequenced and closed chromosome. The Vibrios will have at lease two contigs, as they have two chromosomes. Some essentially complete genomes are not completely closed, and will be distributed among several contigs of various lengths. All contigs for the selected organism will be listed above the browser.
The scope of the detailed region relative to the contig is represented by a red box on the genome numberline at the top of the display. The extent of the region shown may be increased or decreased by clicking the yellow plus or minus signs, or by making a different selection from the drop down menu. To "walk," or slide the display along the genome, click on the yellow arrows—the single arrow moves half the distance shown, while the double arrow slides the entire increment.
Below the contig overview is the detailed view of features within the selected region. GBrowse opens with the detailed display centered on the gene of interest, showing all genes as blue arrows labeled with functional annotations in pink, and with tracks that display the position of pathogenicity islands and prophages enabled. Genes are linked to their NMPDR context page. Different data tracks may be chosen by the user from selections listed beneath the display. For the NMPDR core organisms, it is possible to select data tracks for closely related genomes, which allows a visual comparison of several strains. The default display is customizable by setting user preferences for track options.
^topIn the NMPDR environment you can visualize a gene, the protein it encodes, the context of the gene on its contig, and a wealth of specific information relating to that gene. The NMPDR protein page presents a table of information related to the target gene and those genes found up- and down-stream on the chromosome. The feature identification number for NMPDR, fid, is listed first. Next are listed the start and stop nucleotide coordinates, length, size of gap (or overlap) between genes, and the orientation on the + or - strand. The functional name assigned to each gene in the NMPDR annotation is listed under "function," while names or numbers assigned to the gene by other sources are listed as aliases with links to external resources such as UniProt, GenBank, and KEGG. A graphic representation of gene context shows the target gene in green, functionally related genes in blue, and unrelated neighbors in red.
Two powerful tools for comparative analysis of functional clustering are linked as buttons on the NMPDR protein page. The "CL" button will open a table of homologs to the target gene that appear in functional clusters in the genomes of the other organisms. The table is ordered by the size of the cluster, that is, by how many genes are functionally linked. The "Pins" button will present a graphic of gene clusters, similar to the context graphic, but this will include homologous regions from many organisms, ordered phylogenetically. The target gene is numbered 1, and clustered homologs share a numerical label.
At the bottom of the NMPDR protein page, links are provided to other sites and to tools that help to analyze the encoded protein. If you have a tool that you would like linked from our site, or a tool that you would like to link into our site, please contact us.
^topThe term functional coupling has been used to indicate that two genes appear to have related functional roles (e.g., the encoded proteins both participate in the same metabolic pathway or they both are components in a single complex). One exciting challenge to bioinformatics is to predict functional coupling. Perhaps the most effective technique for doing so relates to analysis of proximity on the chromosome; when two genes tend to occur fairly close to one another in numerous genomes, it amounts to solid evidence that the roles of the gene products are closely related. For details see "The use of gene clusters to infer functional coupling," Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2896-2901.
Because comparative analysis of gene clusters has begun to play a much larger role in determination of gene function (due to the rapid increase in the number of available genomes), we have computed instances in which genes appear to be functionally coupled and make the inferences accessible from the NMPDR environment. When you are on the NMPDR protein page, which shows a table of the genes that occur in the region around the gene you are focused on, you will see a column labeled "fc-sc." If a number occurs in this column, there is evidence based on clustering that the genes with numbers are functionally coupled to the gene of focus in that number of genomes. The number is actually a link to a table displaying co-occurrences of the two genes.
^topPathogenicity islands and prophages are features that have a start and stop location on the contig, which usually overlap numerous other annotated features, such the genes that fall into the same region, and sometimes each other. It is possible to get an overview of the pathogenecity islands and prophages on a linear map in the GBrowse environment, and then to zoom in on the region to see the curated CDS that fall within those regions.
The pathogenicity islands, prophages, and CDS:curated tracks are shown in GBrowse by default. Also by default, the detailed view opens to a 20 kbp region, which is too tight to get a good view of the larger features, but tight enough to see the curated names of the genes. By resetting the viewing region to 500 kbp or to 1 Mbp in the Show xxx bp pull-down menu, the large features will be located. Then zoom in by resetting the viewing area to the next smaller increment and clicking the numberline to stay centered on the large feature. The curated names of the CDS won't be visible until you get down to a 20 kbp region, but if you zoom in all at once, you may lose the large feature to the left or right of the detailed view. There are scales in both the detailed view and the overview to help find the feature as you zoom in.
^topNMPDR contains bi-directional best hits (BBH) precomputed using BLASTP. Two sequences S1 and S2 from genomes G1 and G2 are bi-directional best hits if:
The clear disadvantage of BBH is that duplicates or paralogs within the same genome will not be listed in the BBH table. The table will display the annotations, E-values (probabilities that the BLAST hit is random), and links to similar proteins in other organisms.
^topWhen we load data into the NMPDR we collect links for each feature. Some of these links take you to corresponding entries in databases maintained by numerous groups worldwide. Other links take you to tools that will aid in your analysis of the proteins. In either case, you need to ensure that the database and tool is appropriate for the question that you are trying to answer. Not all tools are appropriate for all questions. If you know of a resource that we should be linking features to, please feel free to point this out to us.
^topSearching for genes that discriminate between two sets of organisms
The motivation for the Signature Genes Tool is to try to locate genes related to a phenotype that is associated with one set of organisms (call this set1) but not with another (call these set2).
The search goes through the genes in one organism from Set 1, selected as the reference genome. For each gene in the reference genome, the tool evaluates the bidirectional best hits of the genes that occur in genomes from set1 and set2. It tabulates these and constructs a score from 0 to 1. A score of 1 means that the gene has a bidirectional best hit in every genome from set1 and no bidirectional best hits against any genome in set2.
The scores are tabulated. The best candidate genes are then presented to you as a list of genes to explore. The main shortcoming of the tool relates to our use of bidirectional best hits. If there are paralogs to the gene within genomes, a bidirectional best hit may not exist in a genome that contains several clear homologs. This means that we may miss genes with paralogs, and we may include genes that do not discriminate as well as we seem to indicate. This means that you must explore each gene as a candidate, but nothing more. There is now the option of running the tool using precomputed similarities rather than bidirectional best hits.