Computing GO and PFAM Attributes
From NMPDR Wiki
NMPDR Standard Operating Procedure SOP007
Contents |
[edit]
Introduction
This standard operating procedure (SOP) describes the steps required to export data from the NMPDR in GFF3 format.
[edit]
SCOPE
This SOP applies to the procedures to assign GO and PFAM attributes to genes in the SEED database. These attributes are incorporated automatically into the next NMPDR release.
[edit]
Applicable Regulations and Guidelines
| NMPDR Contract | Delivery of NMPDR SOPs |
| BRC Metrics | Production of metrics |
| GO | List of GO terms |
| Transaction Logging | NMPDR Logging requirements |
[edit]
Responsibility
This SOP applies to those members of the NMPDR research team involved in managing the attributes of NMPDR data. This includes the following:
- Annotators
- Bioinformaticians
[edit]
Definitions
- Standard Operating Procedures (SOPs): Detailed, written instructions to achieve uniformity of the performance of a specific function.
- Subsystem: A collection of functional roles that together implement a specific biological process or structural complex.
- FigFam: Protein families. Each family is intended to contain a set of globally similar proteins that implement the same function.
- Annotation: A tuple consisting of a date, annotator name, and textual message.
- Structured Annotation: An annotation where the text is structured. There are two kinds of structured annotations
- Placement of a gene within a subsystem.
- Assignment of a function to a gene.
[edit]
Process Overview
- Establish the computation environment.
- Identify genomes to be processed.
- Run the automated process.
- Install the new attributes.
[edit]
Computation Procedure
[edit]
Establishing the Computation Environment
- Log into bio-ppc-1.mcs.anl.gov
- Change to the bash shell.
- Source the FIG environment: source /home/username/FIGdisk/config/fig-userĀenv.sh .
- Change to the ~mkubal/Domain_Analysis directory.
[edit]
Identifying Genomes to be Processed
- Create a text file named nmpdr_genomes_to_be_processed.txt. On each line should be the genome ID of a newly-added NMPDR genome.
- Run the perl script submit_nmpdr_genomes_to_pipeline.pl. Depending on the load on the cluster, this will take approximately 4 hours per genome.
[edit]
Running the Automated Process
- Run the perl script perl_parse_pfam_by_genome.pl.
- Change to the ~mkbual/Domain_Analysis/NMPDR_Results directory.
- Concatenate the GO results into a single file.
- Run the perl script prepare_go_for_bruce.pl to generate an attribute file for the GO data.
- Concatenate the PFAM results into a single file.
- Run the perl script prepare_pf_for_bruce.pl to generate an attribute file for the PFAM data.
The commands to perform the above tasks are shown below.
perl perl_parse_pfam_by_genome.pl cd ~mkubal/Domain_Analysis/NMPDR_Results cat *_go_* >go_input.txt cat *_pfam_* >pfam_input.txt perl prepare_go_for_bruce.pl >go_attributes.tbl perl prepare_pf_for_bruce.pl >pf_attributes.tbl
[edit]
Installing the New Attributes
Copy the tbl files generated by the previous step to the disks/nmpdr/attributes directory on nmpdr-1.nmpdr.org.
