NMPDR Wiki

The Official Documentation Site for the National Microbial Pathogen Data Resource

Annotation Procedure

From NMPDR Wiki

NMPDR Standard Operating Procedure SOP010

Contents

Introduction

This standard operating procedure (SOP) describes the operations followed by NMPDR personnel for annotating gene function in the NMPDR database. Manual annotation of gene function within the NMPDR relates primarily to the creation and maintenance of two categories of data: populated subsystems and FIGfams (protein families derived from the populated subsystems and analysis of genes occurring in closely-related strains). We refer the reader to [1] for a detailed description of subsystems, how they are produced, and how they are intended to be used. The following quote from that paper summarizes the basic concepts:

By the term subsystem, we refer to a collection of functional roles that together implement a specific biological process or structural complex. A subsystem may be thought of as a generalization of the term pathway. Thus, just as glycolysis is composed of a set of functional roles (glucokinase, glucose-6-phosphate isomerase, phosphofuctokinase, etc.) a complex like the ribosome or a transport system can be viewed as a collection of functional roles. The genes in each specific organism that includes the subsystem are thought of as implementing those functional roles. In this very general use of the term subsystem we make no distinction between metabolic subsystems (i.e., metabolic pathways) and non-metabolic subsystems.

By the term populated subsystem we refer to a subsystem encapsulated in a matrix in which each column represents a functional role for the subsystem, each row represents a specific genome, and each cell contains those genes from the specific organism that implement the specific functional role.

The NMPDR annotators have created hundreds of subsystems which are continually curated. New genomes are integrated into the subsystem collection, new subsystems covering everything from central metabolism to virulence factors are initiated, and the entire collection is made continuously available. The NMPDR works cooperatively with a growing community of annotators building these subsystems. FIGfams are protein families. Each family is intended to contain a set of globally similar proteins that implement the same function. These families arise from two sources:

  1. Columns from subsystems are used as one source. In this case, our annotators have manually constructed these sets.
  2. A second source comes from forming sets of corresponding genes from a set of very closely related genomes. For example, we have taken all of the Listeria monocytogenes genomes, algorithmically computed correspondences between genes in those genomes forming what we call close-strain sets.

The FIGfams have not yet been published and released. At this point they are still being refined and corrected. Efforts involving manual curation are being used to evaluate differences from other protein family collections – most notably the TIGR equivalogs and the PIRSF. The allocation of annotator time is based on forming a prioritized list of subsystems to be developed, which is under continual review. Annotators must continuously update their existing subsystems as new genomes are added to the collection, and when possible start new subsystems from the prioritized list. Management prioritizes these subsystems based on increasing our coverage of the NMPDR genomes, development of subsystems covering known virulence factors and essential genes, and the degree to which the genes involved can be mapped to orthologs in other organisms (ideally other pathogens). At this time, approximately 40% of the genes in most of the NMPDR genomes are included in existing subsystems that are continuously, manually curated.

Scope

This SOP applies to the procedures to manually annotate gene function in the NMPDR. It describes the steps followed by the NMPDR annotators, describes decision procedures and codes assigned as a result of the decision procedures.

Applicable Regulations and Guidelines

NMPDR Contract Delivery of NMPDR SOPs
BRC Metrics Production of metrics
GO List of GO terms
Transaction Logging NMPDR Logging requirements

Responsibility

This SOP applies to those members of the NMPDR research team involved in annotating data. This includes the following:

  • Principal Investigator
  • Annotation Manager
  • Annotators
  • Bioinformaticians

Definitions

  • Standard Operating Procedures (SOPs): Detailed, written instructions to achieve uniformity of the performance of a specific function.
  • Subsystem: A collection of functional roles that together implement a specific biological process or structural complex.
  • FigFam: Protein families. Each family is intended to contain a set of globally similar proteins that implement the same function.
  • Annotation: A tuple consisting of a date, annotator name, and textual message.
  • Structured Annotation: An annotation where the text is structured. There are two kinds of structured annotations
    1. Placement of a gene within a subsystem.
    2. Assignment of a function to a gene.

Process Overview

  1. Mark genes for which relevant literature exists.
  2. Use the manually curated subsystems-based annotation if it exists.
  3. Process genes not yet included in subsystems.

Context

Annotators use the SEED annotator machine and the SEED interface to perform annotations.

Annotation Decision Procedure

The process of annotation involves assigning a functional role to a gene. As part of the process, the gene is marked with one or more evidence codes. These are noted in the procedure below.

  1. Mark genes for which relevant literature exists. A semi-automatic procedure exists for attaching and evaluating the existence of relevant literature. Using the tools provided by NCBI, we have attached specific papers to genes from the NMPDR genomes. During curation our annotators can delete connections they consider inappropriate, or they can add references, often under guidance from the user community. The percentage of genes in NMPDR genomes for which connections to publications exist is low. The more common case is when a paper is connected to one or more clear orthologs to a gene from an NMPDR organism. For each gene maintained in our subsystem collection, we have attached relevant papers to the functional roles maintained in the subsystem, and annotators can access these papers as they curate each gene. When papers are connected directly to a gene, the evidence code dlit (for direct literature) is connected the gene. When no direct references are connected, but connections do exist to the functional role of the containing subsystem, a code of ilit (for indirect literature) is attached to the gene.
  2. Use the manually curated subsystems-based annotation if it exists. If a gene has been manually placed in a subsystem by an annotator, this amounts to an assertion by the annotator that the gene implements one or more functional roles from the subsystem. We attach the evidence codes isu or isd to the gene, reflecting whether the gene is the only one performing the role or one of many performing the role.
  3. Process genes not yet included in subsystems. At this point, over 50% of the genes remain outside defined subsystems. These genes are processed automatically. For each gene not in a subsystem, the following procedure is used.
    • If the gene occurs within a FIGfam (presumably arising through close-strain sets, since the gene is not included in a subsystem), it is assigned the function associated with the FIGfam and the evidence code ff is attached.
    • If the gene is functionally clustered with other genes, and the score indicates 5 or more co-occurrences within divergent genomes, then it is assigned the function of the similarly-clustered genes. An evidence code of cwh is all the relevant assignments are hypothetical and cwn otherwise.
    • If the gene has not been assigned a function due to its presence in a FIGfam, it will be assigned a function using a simple algorithm which examines the functional roles assigned to similar genes by other project groups. These assertions will be weighed based on similarity strength and confidence in the source of the assertion. Highest confidence is given to assertions for similar genes in subsystems or [1] entries. Lower confidence is given to other sources. Genes in this category are considered to have unreliable assignments.

Note that a single assignment may qualify for more than one of the above cases and would as a result have several evidence codes attached.

Summary

We rank our confidence in assignments based on the attached evidence codes. Our guidelines to users are as follows:

  1. Genes with codes icw, and/or isu are considered most reliable. An additional dlit (or to a lesser extent ilit) increases confidence.
  2. Genes with idu and/or ff are the next most reliable. Again, dlit and ilit increase confidence.
  3. Genes with cwn have far less reliability, but the functional clustering with a non-hypothetical can be viewed as a suggestive clue.
  4. Finally, genes with just cwh are also considered very unreliable, but the clustering should be viewed as a significant clue.

References

1. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy- Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Ruckert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005 Oct 7;33(17):5691-702. Print 2005.

Personal tools