NMPDR Wiki

The Official Documentation Site for the National Microbial Pathogen Data Resource

Computing GO and PFAM Attributes

From NMPDR Wiki

NMPDR Standard Operating Procedure SOP007

Contents

Introduction

This standard operating procedure (SOP) describes the steps required to export data from the NMPDR in GFF3 format.

SCOPE

This SOP applies to the procedures to assign GO and PFAM attributes to genes in the SEED database. These attributes are incorporated automatically into the next NMPDR release.

Applicable Regulations and Guidelines

NMPDR Contract Delivery of NMPDR SOPs
BRC Metrics Production of metrics
GO List of GO terms
Transaction Logging NMPDR Logging requirements

Responsibility

This SOP applies to those members of the NMPDR research team involved in managing the attributes of NMPDR data. This includes the following:

  • Annotators
  • Bioinformaticians

Definitions

  • Standard Operating Procedures (SOPs): Detailed, written instructions to achieve uniformity of the performance of a specific function.
  • Subsystem: A collection of functional roles that together implement a specific biological process or structural complex.
  • FigFam: Protein families. Each family is intended to contain a set of globally similar proteins that implement the same function.
  • Annotation: A tuple consisting of a date, annotator name, and textual message.
  • Structured Annotation: An annotation where the text is structured. There are two kinds of structured annotations
    1. Placement of a gene within a subsystem.
    2. Assignment of a function to a gene.

Process Overview

  1. Establish the computation environment.
  2. Identify genomes to be processed.
  3. Run the automated process.
  4. Install the new attributes.


Computation Procedure

Establishing the Computation Environment

  1. Log into bio-ppc-1.mcs.anl.gov
  2. Change to the bash shell.
  3. Source the FIG environment: source /home/username/FIGdisk/config/fig-userĀ­env.sh .
  4. Change to the ~mkubal/Domain_Analysis directory.

Identifying Genomes to be Processed

  1. Create a text file named nmpdr_genomes_to_be_processed.txt. On each line should be the genome ID of a newly-added NMPDR genome.
  2. Run the perl script submit_nmpdr_genomes_to_pipeline.pl. Depending on the load on the cluster, this will take approximately 4 hours per genome.

Running the Automated Process

  1. Run the perl script perl_parse_pfam_by_genome.pl.
  2. Change to the ~mkbual/Domain_Analysis/NMPDR_Results directory.
  3. Concatenate the GO results into a single file.
  4. Run the perl script prepare_go_for_bruce.pl to generate an attribute file for the GO data.
  5. Concatenate the PFAM results into a single file.
  6. Run the perl script prepare_pf_for_bruce.pl to generate an attribute file for the PFAM data.

The commands to perform the above tasks are shown below.

 perl perl_parse_pfam_by_genome.pl
 cd ~mkubal/Domain_Analysis/NMPDR_Results
 cat *_go_* >go_input.txt 
 cat *_pfam_* >pfam_input.txt 
 perl prepare_go_for_bruce.pl >go_attributes.tbl
 perl prepare_pf_for_bruce.pl >pf_attributes.tbl

Installing the New Attributes

Copy the tbl files generated by the previous step to the disks/nmpdr/attributes directory on nmpdr-1.nmpdr.org.

Personal tools